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Abstract 

Many  nonparametric  regression  techniques  (such  as  kernels,  nearest  neighbors,  and  smoothing  splines) 
estimate  the  conditional  mean  of  Y  given  X  =  x  by  a  weighted  sum  of  observed  Y  values,  where  observations 
with  X  values  near  x  tend  to  have  larger  weights.  In  this  report  the  weights  are  taken  to  represent  a  finite 
signed  measure  on  the  space  of  Y  values.  This  measure  is  studied  as  an  estimate  of  the  conditional  distribution 
of  Y  given  X  —  x.  From  estimates  of  the  conditional  distribution,  estimates  of  conditional  means,  standard 
deviations,  quantiles  and  other  statistical  functionals  may  be  computed. 

Chapter  1  illustrates  the  computation  of  conditional  quantiles  and  conditional  survival  probabilities  on 
the  Stanford  Heart  Transplant  data.  Chapter  2  contains  a  survey  of  nonparametric  regression  methods  and 
introduces  statistical  metrics  and  von  Mises’  method  for  later  use. 

Chapter  3  proves  some  consistency  results.  The  estimated  conditional  distribution  of  Y  is  shown  to 
be  consistent  in  the  following  sense:  the  Prohorov  distance  between  the  estimated  and  true  conditional 
distributions  converges  in  probability  to  zero.  The  required  conditions  are:  that  the  distribution  of  Y 
given  X  —  x  vary  continuously  with  x,  that  the  weights  regarded  as  a  measure  on  the  X  space  converge 
in  probability  to  a  point  mass  at  x,  and  that  a  measure  of  the  effective  local  sample  size  tend  to  infinity 
in  probability.  A  slight  strengthening  of  the  conditions  allows  one  to  establish  almost  sure  consistency. 
Consistency  of  Prohorov-continuous  (i.e.  robust)  functionals  follows  immediately.  In  the  above,  the  X  and 
Y  spaces  are  complete  separable  metric  spaces.  In  case  Y  is  the  real  line,  weak  and  strong  consistency  results 
are  established  for  the  Kolmogorov-Smirnov  and  the  Vasserstein  metrics  under  stronger  conditions. 

Chapter  4  provides  conditions  under  which  the  suitably  normalized  errors  in  estimating  the  conditional 
distribution  of  Y  have  a  Brownian  limit.  Using  von  Mises’  method,  asymptotic  normality  is  obtained  for 
nonparametric  conditional  estimates  of  compactly  differentiable  statistical  functionals. 
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1  Introduction 
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This  report  is  concerned  with  estimation  of  aspects  of  the  conditional  distribution  of  a 

random  variable  Y  given  another  random  variable  X. 

The  most  familiar  example  is  the  estimation  of  the  conditional  expectation  of  Y  given 
X  —  x.  When  this  is  carried  out  for  a  large  number  of  z’s  the  results  can  be  presented  as 
a  curve.  The  curve  is  usually  plotted  together  with  the  data  used  to  estimate  it.  It  then 
may  be  used  in  informal  data  analysis,  or  its  shape  may  be  used  to  select  or  confirm  a 
parametric  model,  or  finally  it  may  be  used  for  the  prediction  of  Y  values  corresponding 
to  future  X  values. 

No  serious  analysis  of  a  single  sample  of  data  would  stop  at  reporting  the  sample 
mean.  Similarly  in  the  bivariate  case  there  is  a  need  to  go  beyond  the  examination  of 
the  estimated  conditional  mean.  Estimating  conditional  standard  deviations  is  a  natural 
first  step  in  this  direction.  For  the  data  analyst,  a  plot  with  a  running  mean  and  with 
curves  equal  to  the  running  mean  plus  or  minus  two  running  standard  deviations  would 
be  useful  in  assessing  whether  the  data  are  heteroscedastic.  If  they  are,  such  a  plot  would 
show  where  and  by  how  much  the  the  variation  fluctuates.  Much  has  been  written  about 
how  hard  it  is  to  perceive  a  conditional  mean  in  a  scatterplot  without  the  presence  of  an 
estimating  curve.  Surely  the  same  is  true  about  the  perception  of  conditional  variance 
or  skewness.  Where  conditional  variances  are  equal,  they  can  seem  to  be  larger  where 
the  marginal  distribution  of  X  is  greater,  because  visual  impressions  are  dominated  by 
extreme  observations.  For  prediction,  an  estimate  of  the  conditional  scale  of  Y  would 
seem  essential  in  order  to  provide  an  interval  about  the  prediction. 

Often  in  one  sample  situations  the  mean  and  standard  deviation  are  not  the  most 
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convenient  way  to  study  the  data.  For  example  in  survival  analysis  the  mean  of  the 
failure  times  is  difficult  to  assess  in  the  presence  of  censoring  but  the  median  and  other 
quantiles  are  readily  obtained.  Where  outliers  are  suspected  the  mean  is  often  replaced 
by  a  trimmed  mean  or  some  other  robust  estimator.  Low  quantiles  are  the  natural  choice 
when  one  studies  breaking  strengths  of  materials.  In  bivariate  situations  where  the  Y 
values  are  subject  to  censoring  or  outliers,  or  in  which  extreme  Y  quantiles  are  of  interest 
it  is  natural  to  compute  running  quantiles  or  robust  estimators  instead  of  a  regression 
curve. 

We  suppose  that  the  estimation  is  performed  in  two  stages.  First  at  each  point  x  in 
a  grid,  the  conditional  distribution  of  Y  given  x  is  estimated.  Then  at  each  grid  point 
a  function  that  takes  distributions  and  returns  means,  variances,  quantiles  or  whatever 
is  applied.  The  results  are  plotted  against  the  grid  points  and  joined  up  to  provide  the 
estimate  of  the  curve.  The  distribution  estimators  considered  are  all  nonpar ametric  and 
discrete.  They  are  reweightings  of  the  Y  sample  adapted  from  weights  used  in  nonpara- 
metric  regression  techniques. 

Figure  1  (page  7)  presents  the  Stanford  Heart  Transplant  data.  The  horizontal  axis 
is  the  age  at  entry  to  the  transplant  program  of  a  patient.  The  vertical  axis  is  the  base  10 
logarithm  of  the  number  of  days  the  patient  was  observed  to  survive  after  the  operation. 
There  are  157  data  points.  Points  marked  “X”  represent  times  of  death  and  points  marked 
“+”  represent  censoring  times.  All  that  is  known  about  the  time  of  death  for  a  censored 
patient  is  that  it  exceeds  the  time  recorded. 

Other  variables  were  recorded,  but  the  survival  time  is  of  primary  interest  and  the 
age  at  entry  is  the  most  useful  predictor  of  it.  The  most  notable  feature  of  this  data  is 
the  drop-off  in  survival  at  entry  ages  over  50  years.  This  feature  is  hard  to  see  in  the  raw 
data,  especially  because  of  the  censoring. 

The  observed  ages  were  used  to  form  a  grid  and  at  each  such  age  a  reweighting  of  the 
ordered  pairs  (survival  time,  censoring  indicator)  was  obtained.  (The  weights  were  based 
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on  symmetric  triangular  nearest  neighbors.  See  Sec.  2.2.  The  k  =  23  nearest  neighbors 
on  each  side  of  the  target  point  were  used.)  Because  interest  centers  on  the  distribution 
of  survival  times,  the  censoring  is  a  nuisance.  It  is  usually  handled  by  calculating  the 
product-limit  estimator  of  the  survival  function.  A  convenient  way  to  do  this  for  weighted 
distributions  is  via  the  “redistribute  to  the  right”  algorithm  of  Efron  (1967). 

In  Figure  2  (page  8)  there  are  5  estimated  conditional  survival  quantiles  corresponding 
to  levels  (.1,.3,.5,.7,.9).  The  quantile  curve  labelled  .7  represents  an  estimate  of  the  (log) 
time  at  which  70%  of  patients  will  still  be  alive  given  their  respective  ages  at  entry.  Some 
of  the  survival  quantiles  are  themselves  censored.  For  example,  the  time  at  which  only 
10%  of  25  year  olds  will  remain  is  censored.  This  is  because  there  was  more  than  10% 
censoring  in  the  data  used  to  estimate  the  survival  time  given  an  age  of  25  years  at  entry. 

The  sharp  drop  in  the  median  survival  time  is  also  evident  in  the  70%  survival  curve 
and  to  a  lesser  extent  in  the  other  survived  deciles. 

Another  way  to  look  at  the  ensemble  of  estimated  survival  probabilities  is  to  estimate 
for  each  z,  the  conditioned  probability  of  survived  past  a  certain  time.  Figure  3  (page  9) 
contains  a  plot  of  such  curves  for  the  probability  of  survived  past  10,  100,  and  1000  days. 
Also  plotted  are  the  probabilities  of  surviving  some  interpolated  times,  roughly  3,  32  and 
316  days.  (The  estimated  3  day  survival  probability  is  1  for  older  patients,  so  that  curve 
disappears  at  the  right  of  the  figure.)  The  probability  of  survival  past  100  days  drops 
sharply  at  the  age  of  50.  So  does  the  probability  of  survival  past  1000  days  and  the  curves 
are  roughly  parallel.  The  probability  of  survival  past  10  days  differs  markedly  from  the 
curves  for  longer  survival  times — it  is  almost  flat. 

The  sort  of  calculations  illustrated  on  this  data  are  similar  to  those  that  a  data 
analyst  might  make  on  a  univariate  sample.  The  next  natural  step  might  be  to  compute 
conditional  hazard  functions  and  plot  a  hazard  surface,  using  age  at  time  of  entry  and 
days  since  the  operation  as  coordinates.  One  might  also  apply  Greenwood’s  formula 
conditionally  to  estimate  conditional  standard  deviations  of  the  probabilities  in  Figure 
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3.  Approximate  confidence  intervals  for  conditional  probabilities  could  be  used  to  obtain 
confidence  intervals  for  conditional  quantiles.  Any  functional  that  a  statistician  applies  to 
a  sample,  might  in  the  bivariate  case  be  applied  conditionally  on  X. 

Methods  like  these  will  be  analyzed  by  considering  separately  the  properties  of  the 
functional  and  the  distribution  estimator.  This  approach  has  certain  economies:  for  exam¬ 
ple,  if  the  distribution  estimator  is  suitably  consistent  then  so  are  running  versions  of  any 
functional  that  is  robust  at  the  underlying  distributions  of  Y  given  x.  There  is  no  need 
for  specific  investigation  of  the  functional  beyond  that  needed  to  establish  its  robustness. 

The  probability  model  to  be  used  is  defined  in  Chapter  2.  It  incorporates  i.i.d.  sam¬ 
pling  of  (X,  Y)  pairs  and  designed  sequences  of  X’s.  The  notation  is  introduced  along 
with  the  exposition  of  the  model.  Chapter  2  continues  with  examples  of  nonparametric 
regression  techniques  that  can  be  made  into  estimates  of  the  conditional  distribution  of 
Y.  Some  background  material  concerning  statistical  functionals,  metrics  on  spaces  of  dis¬ 
tributions,  bivariate  probability  models  and  compact  differentiability  is  given  in  Chapter 
2.  Some  lemmas  are  presented  in  Chapter  2,  for  later  use.  Nonparametric  regression 
techniques  are  predicated  on  an  assumption  that  when  X  is  near  z,  the  conditional  mean 
of  Y  is  close  to  its  value  at  x.  Usually  one  can  assume  more:  when  X  is  near  z,  the 
distribution  of  Y  is  near  to  the  distribution  it  takes  at  z.  In  Chapter  2  this  idea  is  made 
precise  by  placing  a  metric  on  the  distributions  of  Y  and  assuming  that  the  conditional 
distributions  under  this  metric  are  continuous  in  z. 

Chapter  3  studies  consistency.  Sufficient  conditions  for  pointwise  weak  and  strong 
consistency  of  the  estimated  conditional  distribution  of  Y  are  given.  Consistency  in  three 
of  the  statistical  metrics  (Prohorov,  Kolmogorov-Smirnov  and  Vasserstein)  from  Chapter 
2  is  obtained.  Consistency  of  running  functionals  then  follows  for  continuous  functionals. 

Chapter  4  studies  asymptotic  normality.  First,  asymptotic  normality  of  the  regression 
function  is  developed.  This  extends  to  the  finite  dimensional  distributions  of  the  condi¬ 
tional  empirical  process.  A  functional  central  limit  theorem  is  then  proved.  Asymptotic 
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normality  conditions  for  the  regression  may  be  translated  into  conditions  for  running  ver¬ 
sions  of  some  functionals.  The  class  of  compact  differentiable  functionals  is  considered. 
Using  von  Mises’  method  the  running  functional  is  decomposed  into  the  sum  of  a  regres¬ 
sion  function  and  a  remainder  term.  Sufficient  conditions  for  the  remainder  term  to  be 
asymptotically  negligible  are  provided. 

The  scope  is  limited  as  follows:  pointwise  (not  uniform  in  x)  results  are  obtained, 
problems  of  bandwidth  selection  are  not  considered,  and  rates  of  convergence  are  not 
computed.  These  represent  three  worthwhile  directions  for  extension;  perhaps  a  good 
starting  point  for  each  might  be  based  on  the  way  these  extensions  are  made  for  regres¬ 
sions.  Pointwise  results  (that  hold  a.e.)  are  stronger  than  global  results  but  not  as  strong 
as  uniform  results.  The  pointwise  approach  handles  designed  as  well  as  sampled  predictors 
whereas  global  results  usually  assume  i.i.d.  sampling  of  predictor-response  pairs.  (These 
distinctions  are  discussed  in  Chapter  3.)  Bandwidth  selection  might  be  tuned  to  some 
loss  function  on  distributions  or  to  a  particular  functioned  such  as  the  mean.  It  should 
be  reasonable  to  select  a  bandwidth  for  regression  estimation  and  use  it  in  the  associated 
distribution  estimator.  Whether  one  might  do  better  by  a  direct  method  is  an  interest¬ 
ing  issue  but  depends  on  the  loss  function  imposed  on  estimates  of  the  distribution.  In 
nonparametric  regression  the  attainable  rate  of  convergence  depends  on  the  number  of 
continuous  derivatives  that  the  regression  function  admits.  Similar  results  might  be  ex¬ 
pected  to  hold  for  estimators  of  conditional  distribution  functions.  The  models  considered 
here  do  not  go  beyond  continuity  (or  Lipschitz  continuity)  of  the  Y  distributions  as  a 
function  of  x.  With  extensions  to  differentiability,  it  would  be  profitable  to  consider  rates. 

In  developing  theorems  and  notation,  emphasis  was  placed  on  getting  theorems  that 
applied  broadly,  with  conditions  and  conclusions  that  are  easy  to  interpret.  Theorems  3.2.2 
and  3.3.1  are  the  most  successful  in  this  regard.  While  minimal  assumptions  are  placed  on 
the  estimators  of  the  conditioned  distribution,  there  is  more  structure  than  usual  placed 
on  the  conditional  distributions  of  Y  given  x.  In  particular,  some  form  of  continuity  is 
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always  assumed.  The  opposite  approach  is  to  place  (almost)  no  conditions  on  the  data  and 
impose  whatever  conditions  on  the  method  yield  optimal  results.  This  is  appropriate  when 
one  knows  very  little  about  the  data  because  the  statistician  has  complete  control  over 
the  method.  It  is  especially  reasonable  when  there  is  a  bone  fide  loss  function  to  which 
the  optimal  asympotic  result  applies.  However,  when  one  is  reasonably  sure  that  some 
structure  is  present,  and  has  reasons  unrelated  to  asymptotics  for  choosing  one  estimator 
over  another,  then  broadly  applicable  results  that  exploit  some  structure  are  of  value. 
Also,  broad  conditions  can  expose  similarities  between  apparently  different  methods. 

The  approach  taken  here — discrete  estimation  of  the  conditional  distribution  followed 
by  the  application  of  a  functional  is  taken  from  Stone  (1977),  who  uses  it  to  obtain  global 
IP  consistency  for  nearest  neighbor  regressions,  quantile  estimates  and  conditioned  Bayes 
rules.  In  his  discussion  of  Stone’s  paper,  Brillinger  (1977)  suggests  the  application  of  like¬ 
lihood  functionals  to  the  nonparametrically  estimated  conditional  distributions.  Brillinger 
also  suggested  extensions  to  conditional  M-estimates  which  would  have  advantages  of  ro¬ 
bustness.  In  his  rejoinder  Stone  proves  global  weak  consistency  of  the  conditional  estimate 
by  exploiting  its  continuity  with  respect  to  the  Prohorov  metric. 

Cleveland  (1979)  uses  running  versions  of  conditional  M-estimators.  Tibshirani  (1984) 
considers  local  estimation  of  likelihood-based  models.  Hardle  and  Gasser  (1984)  establish 
consistency  and  asymptotic  normality  of  some  conditional  M-estimators.  Stute  (1986) 
obtains  a  functional  central  limit  theorem  for  a  nearest  neighbor  estimator.  Some  other 
references  to  results  in  the  literature  ate  made  in  Chapters  2,  3  and  4. 

Conditional  medians  were  considered  for  the  heart  transplant  data  by  Doksum  and 
Yandell  (1983).  Tibshirani  (1984)  computes  local  proportional  hazards  models  for  this 
data.  Segal  (1986)  develops  a  rank-based  version  of  the  regression  trees  methodology  of 
Brieman  et.  al.  (1984)  that  can  be  applied  to  censored  data.  In  particular  he  applies  it 
to  the  heart  transplant  data  and  finds  that  the  first  split  is  made  on  the  age  variable  at 
an  age  of  50. 
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Figure  1  Stanford  Heart  Transplant  Data. 


Heart  Transplant  Data 


The  horizontal  axis  is  the  age  of  a  patient  on  the  date  of  entry  to  the  transplant 
program.  The  vertical  axis  is  the  logarithm  (base  10)  of  the  number  of  days  the  patient 
was  observed  to  survive  after  the  operation.  There  are  157  data  points.  Points  marked 
X  represent  times  of  death  and  points  marked  a+”  represent  censoring  times. 
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Figure  2  Odd  Survival  Deciles. 


Odd  Survival  Deciles 


The  axes  are  as  in  Figure  1.  The  curve  labelled  “.5”  is  an  estimate  of  the  conditioned 
median  log  survival  time  of  a  heart  transplant  patient,  given  the  patient’s  age  at  entry. 
The  other  curves  correspond  to  the  estimated  log  times  to  which  10%,  30%,  70%  and  90% 
of  patients  will  survive  given  their  age  at  entry.  Portions  of  the  10%  and  30%  curves  are 
censored.  For  example,  the  time  at  which  only  10%  of  25  year  olds  will  remain  is  censored 
because  there  was  more  than  10%  censoring  in  the  data  used  to  estimate  the  survival  time 
given  an  age  of  25  years  at  entry. 
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Figure  3  Survival  Probabilities 


Survival  Probabilities 


The  horizontal  axis  gives  the  age  at  entry  of  a  patient  to  the  Stanford  Heart  Transplant 
program.  The  vertical  axis  gives  the  estimated  conditioned  probability  of  survival  past  10, 
100,  and  1000  days,  given  the  age  at  entry.  Also  plotted  are  the  probabilities  of  surviving 
some  interpolated  times,  roughly  3,  32  and  316  days.  (The  estimated  3  day  survival 
probability  is  1  for  older  patients,  so  that  curve  disappears  at  the  right  of  the  figure.) 


2  Preliminaries 


This  chapter  introduces  the  notation  used  throughout,  and  provides  some  examples  of 
estimators  for  conditional  distributions.  It  also  includes  a  discussion  of  statistical  func¬ 
tionals,  of  metrics  on  distributions,  of  models  for  conditional  distributions  and  of  von 
Mises’  method  and  compact  differentiability  of  statistical  functionals. 

2.1  Notation 

The  data  consist  of  pairs  (X,,  Yi)  where  *  =  1, . . .,  n.  The  take  values  in  a  set  X  and 
are  thought  of  as  predictors.  The  response  variable  Yi  is  a  member  of  the  set  t/.  Unless 
otherwise  specified  X  C  2R  and  y  —  JR  and  both  are  endowed  with  the  usual  Euclidean 
distance  and  topology.  X  and  Y  are  used  as  typical  data  values  that  do  not  necessarily 
correspond  to  any  specific  observation. 

Interest  centers  on  the  conditional  behavior  of  Y  given  X.  To  this  end  it  is  convenient 
to  consider 

F*(y)  =  P(Y  <y\X  =  x)  (1) 

which  for  fixed  x  G  X  is  a  distribution  function  on  t/  and  for  fixed  y  €  y  is  a  function  on 
X .  Given  that  X{  =  x,-,  Yi  has  distribution  function  FX{.  The  random  distribution  Fx{  is 
equal  to  FX{  when  X ,•  =  x,-. 

F,  represents  the  mapping  x  —*  Fx  from  X  to  the  space  of  distributions  on  y .  Regular¬ 
ity  conditions  about  the  behavior  of  the  distribution  of  Y  for  varying  X  will  be  expressed 
in  terms  of  Fm.  This  will  generally  mean  that  F,  is  continuous  or  Lipschitz  continuous 
when  the  distributions  on  1/  Me  given  an  appropriate  metric. 
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The  probability  model  for  the  data  is  as  follows:  the  X’s  axe  drawn  according  to  a 
design  measure  (that  does  not  depend  on  the  Y  values),  and  the  Y’s  are  drawn  from  the 
corresponding  conditional  distributions  and  are  conditionally  independent  given  the  X’s. 

The  design  measure  for  the  X’s  could  be  a  prescribed  design  sequence  (design  case) 
or  it  could  be  i.i.d.  sampling  from  some  distribution  on  X  (sampling  case)  or  it  could 
be  more  complicated  involving,  say,  a  randomized  choice  among  designs  or  dependent 
sampling  that  tends  to  fill  in  gaps  left  in  X  by  the  previous  observations.  The  stipulation 
that  the  design  measure  not  depend  on  the  Y  rules  out  some  sequential  methods  that 
might  be  of  value. 

A  convenient  construction  to  describe  the  conditional  independence  of  the  Yi  given 
the  Xi  is  obtained  as  follows:  introduce  i.i.d.  standard  uniform  random  variables  C/,-,  i  = 
1, . . . ,  n  that  are  independent  of  the  X’s,  then  put 

n  = 

We  define  inverses  of  distribution  functions  as  follows:  G-1(u)  =  inf{y  :  G(y)  >  u}  and 
G-Hu}  =  {y  :  G(y)  =  u}. 

For  some  fixed  point  x  €  X  let 

y, r = n'm.  (2) 

Then  Y*,  i  =  l,...,n  are  i.i.d.  random  variables  with  distribution  Fx.  Intuitively,  Y* 
is  what  Yi  would  have  been  if  Xj  had  been  x.  This  construction  will  be  handy  in  bias- 
variance-like  decompositions. 

The  focus  of  interest  will  often  be  one  or  more  functionals 

T(£(Y\X  =  x))  =  T{Fzy,  ■ 

commonly  considered  functionals  are  the  mean,  mode,  median,  other  quantiles,  M-esti- 
mates,  m.l.e.’s  and  variance  estimates  of  the  aforementioned .  T(FX )  can  be  thought  of  as 
a  function  on  X  as  x  varies.  The  regression  function  arises  for  T(-)  =  m(-),  where 


m(Fx)  =  /  y  dFx(y) 


(3) 
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is  the  mean.  It  should  cause  no  confusion  to  use  m(x)  for  m[Fz). 

To  analyze  T(-),  consider  it  as  a  mapping.  Its  domain  Dr  must  naturally  contain  Fx 
for  all  x  €  X .  It  will  also  have  to  contain  estimates  of  Fx  obtained  from  the  data.  These 
will  be  distributions  with  support  in  a  finite  set.  Unless  otherwise  stated  the  range  of 
T  is  1R.  The  domain  Dt  also  comes  equipped  with  a  topology.  Most  of  the  topologies 
considered  are  metrizable.  The  basic  open  sets  in  a  metrized  topological  space  can  be 
taken  to  be  the  open  balls 

Bt{F)  =  {GeDT\p(F,G)<e} 

where  e  >  0  and  F  E  Dt,  and  /?(•,•)  is  a  metric  on  Dt-  The  one  non-metrizable  topology 
considered  is  the  topology  of  weak  convergence  for  finite  signed  measures  used  in  Sec.  3.2. 

The  emphasis  will  be  on  the  Kolmogorov-Smirnov  metric,  the  Prohorov  metric,  and 
the  Vasserstein  metrics.  See  Sec.  2.4  for  a  discussion  of  statistical  metrics. 

The  running  functional  T(FX)  is  estimated  by  T(FX)  where  Fx  is  an  estimate  of  Fx 
based  on  the  data.  Fx  will  depend  on  n  and  (Xi,Yi),i  =  1, . . .,  n  although  this  dependence 
is  suppressed  for  notational  convenience.  Fx  is  not  in  general  a  statistical  functional  by 
virtue  of  its  dependence  upon  n,  but  may  be  thought  of  as  a  sequence  of  such  functionals. 
Fx  need  not  be  a  probability  measure  on  y  in  which  case  it  may  be  necessary  to  extend 
the  definition  of  T(-). 

Following  Stone  (1977)  consider  estimators  Fx  of  the  form 

I'M  =  £>„,(*)«* 

1=1 

where  Sy{  =  ly,<y  is  a  point-mass  at  Yi  and  Wm(x)  is  a  weight  attached  to  the  j’th 
observation  out  of  the  first  n  observations.  W„i  ( x )  depends  on  X\, . . . ,  Xn  and  on  n  but 
not  on  the  Y  values.  To  keep  the  notation  uncluttered,  denote  the  weight  on  the  x’th 
observation  by  Wi  with  n  and  the  target  point  x  understood.  That  is 


W 
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in  terse  notation.  It  is  natural  to  denote  m(Fx)  by  m(x). 

The  weights  form  a  discrete  signed  measure  on  X  with  atoms  of  size  Wi  at  X{ .  This 
measure  is  denoted  Wx,  so  that 

Wx{A)  =  Y,WilXi€A.  (5) 

Many  conditions  on  the  weights  can  be  expressed  in  terms  of  Wx.  For  large  n,  Wx  should  be 
close  to  Sx,  the  point-mass  at  x.  That  notion  is  made  precise  by  putting  a  metric  p  on  the 
distributions  on  X  and  requiring  p(WX)  Sx)  — >  0  in  some  mode  of  stochastic  convergence. 

For  the  regression  function  m(Fx)  =  ^  W.Yi  and  (4)  incorporates  many  of  the  com¬ 
monly  used  nonparametric  regression  techniques  including  smoothing  splines,  kernel  esti¬ 
mators,  nearest  neighbour  estimators,  and  running  linear  regressions.  Sec.  2.2  discusses 
the  choice  of  the  Wi  in  more  detail.  These  weights  are  distinguished  from  adaptive  weights 
which  depend  on  the  Y’s.  For  example,  if  the  smoothing  parameter  in  spline  regression 
or  running  linear  regression  is  determined  by  cross-validation  the  resulting  regression  es¬ 
timate  is  adaptive  and  hence  not  covered  by  (4). 

For  a  given  set  of  weights  put 

If  each  Fx.  has  variance  a2,  then  conditionally  on  the  observed  X%  W{Y{  has  variance 
ffJ/nx.  In  this  sense  nx  is  an  effective  sample  size  at  x.  The  Xi  that  contribute  to  Fx  are 
thought  of  as  a  sample  of  size  nx  from  Wx  and  the  locally  reweighted  Y{  are  thought  of  as 
a  biased  sample  of  size  nx  from  Fx.  In  asymptotic  considerations,  it  will  be  necessary  for 
nx  — ►  oo  to  control  the  variance.  Typically  nx/n  — >  0  as  n  — +  oo  and  this  allows  Wx  to 
converge  to  6X  to  control  the  bias.  For  a  fixed  sample,  nx  regarded  as  a  function  of  x  can 
be  used  to  compare  precision  over  the  range  of  X  .  It  can  also  be  used  in  heuristic  degree 
of  freedom  calculations  for  pointwise  t-tests  and  intervals. 

Most  consistency  proofs  for  nonparametric  regressions  start  with  a  decomposition 

m(x)  -  m(x)  =  -  m{Xi))  +  £V,(mpft)  -  m(x))  -  m(x)(l  -  ^V,)- 
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Conditionally  on  the  X’s,  the  first  term  is  a  sum  of  mean  zero  random  vairables,  and  differs 
from  zero  because  of  sampling  variability  in  the  Yi  and  the  second  term  is  conditionally 
constant  and  differs  from  zero  because  the  A,’s  are  not  exactly  at  x.  It  is  natural  to  call 
the  first  term  a  variance  term  and  the  second  term  a  bias  term,  though  strictly  speaking 
these  labels  refer  to  the  second  moment  of  the  first  term  and  the  first  moment  of  the 
second  term  respectively.  The  decomposition  considered  here  is  of  the  form 

m(x)  -  m(x)  =  -  m(x))  +  £  W((«  -  5?)  -  m(x)(l  -  (7) 

In  this  decomposition  the  first  and  second  terms  will  still  be  refered  to  as  variance  and 
bias  terms,  but  the  variance  term  in  (7)  is  conditionally  a  weighted  sum  of  i.i.d.  mean 
zero  terms  and  moreover,  the  terms  Y*  —  m(x)  are  independent  of  the  -X.’s  and  hence 
also  of  the  W,’s.  This  makes  the  variance  term  easier  to  handle,  at  the  expense  of  some 
complication  in  the  bias  term.  However  the  bias  term  in  (7)  is  tractable,  and  may  be 
conveniently  analyzed  via  Vasserstein  metrics.  A  similar  decomposition  of  Fx  will  be  used 
in  Chapter  3. 

2.2  Examples  of  Weights 

This  section  presents  some  examples  of  weights  that  fit  into  the  framework  of  the  Sec. 
2.1.  Most  of  the  weight  schemes  discussed  here  were  developed  for  estimating  regression 
functions.  Similar  ideas  have  been  used  in  density  estimation  and  in  the  estimation  of 
spectral  densities.  The  discussion  covers  in  turn  the  following  methods:  kernels,  nearest 
neighbors,  symmetric  nearest  neighbors,  local  linear  regressions,  and  smoothing  splines. 
A  final  subsection  discusses  some  other  related  ideas.  For  a  comprehensive  bibliography 
of  nonparametric  regression  techniques  see  Collomb  (1985). 

2.2.1  Kernel  Smoothers 


Kernel  estimates  of  the  regression  function  were  introduced  by  Nadaraya  (1964)  and 
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Watson  (1964).  For  the  kernel  estimate: 

.  g(^) 


Wi  = 


(1) 


***'(*?)  . 

where  K(v)  is  a  function  called  the  kernel  and  bn  >  0  is  a  constant  called  the  bandwidth 
We  assume  that 

J  jif(v)|dt;  <  oo 


and 

J  K{y)dv  =  1. 

The  latter  is  a  convenient  normalization — multiplying  if  by  a  (nonzero)  constant  would 
not  change  W{  and  might  make  computation  easier.  Consistency  of  the  kernel  regression 
estimate  generally  requires  that  bn  — ►  0  at  an  appropriate  rate. 


Kernel  regression  estimators  were  preceded  by  kernel  density  estimators.  Nadaraya 
(1964)  cites  Parzen  (1962)  and  Watson  (1964)  cites  Rosenblatt  (1956).  Kernel  methods 
were  previously  used  in  spectral  density  estimation.  This  connection  is  discussed  in  Subsec. 
2.2.3. 

We  give  some  examples  of  kernel  functions  for  X  C  1R  taken  from  Benedetti  (1977). 
There  are  obvious  extensions  to  JRd. 


Examples: 


1 

Uniform 

K(v)  =  5 1  Jw|<i 

2 

Triangular 

K{v)  =  (1  -  M)+ 

3 

Quadratic 

K(v)  =  i(l-N’) 

4 

Exponential 

K(v)  =  |e-H 

5 

Gaussian 

*00  = 

6 

Cauchy 

=  JtAt 

7 

Fejer 

*00  = 

The  quadratic  kernel  is  often  referred  to  as  the  Epanechnikov  kernel.  Epanechnikov 
(1969)  argues  that  it  is  the  optimal  shape  for  estimating  densities  in  any  dimension  so 


2.2  Examples  of  Weights  16 


long  as  the  true  density  is  sufficiently  smooth.  (It  is  optimal  within  the  class  of  bounded 
symmetric  probabilty  densities  for  which  all  moments  are  finite.  An  integrated  squared 
error  criterion  is  used.) 

Kernels  with  negative  sidelobes  (for  instance  the  Fejer  kernel)  are  used  to  reduce  bias. 
See  Watson  (1964)  for  an  example. 

2.2.2  Nearest  Neighbor  Smoothers 

Nearest  neighbor  methods  originated  with  Fix  and  Hodges  (1951)  in  the  context  of 
nonparametric  discrimination.  They  were  first  used  in  density  estimation  by  Loftsgaarden 
and  Quesenberry  (1965).  The  first  general  discussion  in  the  regression  context  seems  to 
be  Royall  (1966),  though  Watson  (1964)  mentions  uniform  nearest  neighbors. 

Let  c,„,  1  <  »  <  n  <  oo  be  a  triangular  array  of  real  numbers.  If  there  are  no  ties 
among  the  first  n  X’s  then  the  nearest  neighbor  weights  are 

Wi  =  Win  —  cr(t)n  (2) 

where  r(t)  is  the  rank  of  ||X,-  —  x||  among  the  first  n  observations.  If  there  are  ties  in  the 
X’s  break  them  arbitrarily,  for  example  by  using  the  index  t,  and  assign  weights  from  (2). 
Then  for  each  set  I  of  indices  corresponding  to  tied  X’s  let 

w'  =  tt52w<  (*) 

1  1  «e/ 

and  set  W{  —  Wj,  Vt  G  I. 

Nearest  neighbor  weights  are  called  k  nearest  neighbor  weights  (k-NN)  when  for  some 
Jfe  =  k(n)  =  o(n)  i  >  k  implies  c,-„  =  0.  The  following  are  examples  of  k-NN  weights. 

Examples: 

1  Uniform  cin  =  il|,|<* 

2  Triangular  ctn  =  2(fc  —  *  +  l)+/  ( k(k  -  1)) 

3  Quadratic  c<„  =  6  ((k  —  *  +  1)+)2  /  (k(k  -  l)(2fc  —  1)) 

Nearest  neighbor  weights  analogous  to  kernel  functions  with  unbounded  support  may 


also  be  of  interest. 
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2.2.3  Symmetric  and  One-Sided  Nearest  Neighbors 

When  X  =  IR,  a  family  of  symmetric  nearest  neighbor  methods  are  available  that 
generalize  the  familiar  running  average.  At  first  assume  there  are  no  ties  in  the  X’s,  and 
only  consider  target  points  that  correspond  to  observations:  x  =  xy  for  j  <  n.  Also  assume 
without  loss  of  generality  that  the  first  n  observations  are  ordered  x\  <  xi  <  . . .  <  xn  to 
avoid  complicating  the  subscripts. 

Let  c,-n,  0  <  n  <  oo  be  a  triangular  array  of  weights.  Then  a  symmetric  nearest 
neighbor  scheme  at  the  target  point  xy  has 

Wi  OC  C|,-_y|n.  (4) 

The  constant  of  proportionality  in  (4)  is  usually  chosen  so  that  the  weights  sum  to  1. 
Uniform,  triangular,  quadratic,  etc.  versions  of  symmetric  nearest  neighbor  weights  are 
easily  defined. 

If  z  is  not  an  observation,  but  xy  <  x  <  xy+i,  some  convenient  convention  can  be  used 
for  the  weights.  Natural  examples  axe  Wx  =  Ws.  and  Wx  =  A Wx.  +  (1  —  X)WXj+l  where 
A  =  (xy+i  —  z)  /  (xy+i  —  *y).  In  practice  it  is  likely  to  be  even  more  convenient  to  simply 
compute  T(FX.)  for  *  =  1, . . . ,  n  and  linearly  interpolate  values  of  T. 

Ties  can  be  broken  as  outlined  above  for  nearest  neighbor  weights,  although  ties  at  the 
target  point  are  more  troublesome.  The  following  prescription  for  tie  breaking  generalizes 
the  one  for  nearest  neighbors  while  preserving  some  symmettry  between  the  right  and  left 
sides.  If  there  are  an  odd  number  2 j  +  1  of  observations  at  z,  then  an  arbitrary  choice 
can  be  made  to  assign  them  weights  proportional  to 

{cin>  •  •  •  >  cln>  c0n>  clm  •  •  •  >  cjn} 


and  to  assign  weight  proportional  to  c,n  for  *  >  j  to  the  i’th  observations  on  each  side  of 
the  target.  If  there  are  an  even  number  2  j  of  observations  tied  at  z  then  2j  —  1  of  them  can 
be  assigned  as  above  and  one  of  them  can  get  weight  proportioned  to  cyn.  The  i’th  point 
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to  each  side  of  the  tied  points  then  gets  weight  proportional  to  (c(J+<_x)n  +  C(J+,)n)/ 2. 
After  such  an  assignment,  the  weights  are  equalized  over  sets  of  tied  x,-’s  as  before. 

A  technique  of  Yang  (1981)  can  be  used  to  express  the  most  commonly  considered 
symmetric  nearest  neighbor  weights  in  terms  of  a  symmetric  kernel  function  K(-)  and  Fn, 
the  empirical  distribution  function  of  X: 

Fn(Xj)  -  Fn(x) 

bn 

The  function  Fn  is  also  defined  when  the  X{  are  obtained  from  a  design.  The  constant  of 
proportionality  is  chosen  to  make  the  weights  sum  to  1.  (The  exact  form  of  (5)  is  from 
Stute  (1984).)  For  x  €  (zy,  zy+ 1)  this  formula  implicitly  sets  Wx  =  Wx.. 

One  advantange  of  symmetric  nearest  neighbor  weights  over  kernel  weights  is  that  the 
set  of  values  Wy  is  fixed  in  the  former  and  random  in  the  latter.  The  kernel  method  must 
be  modified  to  handle  the  case  where  the  kernel  function  is  zero  for  all  the  observations, 
but  this  never  happens  with  symmetric  nearest  neighbors.  Such  an  event  can  have  positive 
probability  for  kernels  with  bounded  support.  The  probability  is  generally  small  enough 
to  ignore  in  practice,  but  may  pose  difficulty  in  theoretical  calculations.  An  advantage 
of  symmetric  nearest  neighbor  weights  over  nearest  neighbor  weights  is  that  they  are 
balanced  with  respect  to  the  target  point — except  at  the  ends  there  is  the  same  amount 
of  weight  on  the  left  as  on  the  right  of  the  target.  With  nearest  neighbors  the  amount  of 
weight  on  a  given  side  of  z  is  random  and  could  be  zero,  even  when  z  is  not  at  the  ends  of 
the  data.  An  advantage  of  symmetric  nearest  neighbor  weights  over  both  kernel  weights 
and  nearest  neighbor  weights  is  that  computation  can  be  much  faster.  In  the  case  of  the 
uniform  weights,  the  weight  function  at  xy+i  differs  from  that  at  Xj  in  at  most  two  weights 
Wj+x+k  and  Wy_fc.  The  regression  function  can  be  computed  quickly  by  updating  a  sum 
of  Y’s  counter  and  a  number  of  points  counter.  To  compute  the  regression  at  all  data 
points  requires  only  2 n  additions,  2n  subtractions  and  n  divisions.  Triangular,  quadratic 
and  higher  order  symmetric  nearest  neighbor  regressions  can  be  obtained  by  repeated 
application  of  uniform  symmetric  nearest  neighbors. 


oc 
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Now  consider  (5)  with  an  asymmetric  function  K{-).  An  extreme  departure  from 
symmetry  involves  taking  in  (5)  kernels 

f2(v)  =  2ff(v)lu>o 


and 

L(v)  =  2Jr(v)lu<o 

which  define  right  and  left  sided  nearest  neighbor  weights.  If  F ,  is  piecewise  continuous 
with  a  discontinuity  near  x,  then  the  one-sided  weights  from  the  side  opposite  the  discon¬ 
tinuity  may  provide  a  better  estimate  than  a  symmetric  weights.  A  comparison  of  left  and 
right  sided  estimates  of  Fx  or  T(FX)  might  provide  a  means  of  detecting  discontinuities. 
One  sided  neighborhoods  are  used  to  estimate  regressions  in  McDonald  and  Owen  (1986). 
Note  that  left  sided  weights  are  not  available  for  the  leftmost  observation  and  are  based 
on  few  points  for  observations  near  the  left  end  (and  the  same  comments  apply  to  right 
sided  weights  at  the  right  of  the  data). 

The  symmetric  versions  of  uniform,  triangular,  and  quadratic  nearest  neighbor  weights 
are  related  to  the  truncated,  the  modified  Bartlett  and  one  of  the  Parzen  estimators  of 
spectral  density  respectively  (Anderson  1971,  Chapter  9).  The  relationship  is  as  follows: 
the  estimate  of  the  spectral  density  at  frequency  u  is  a  weighted  sum  of  ercos(u/r)  for 
|r|  <  k,  where  cr  is  the  sample  autocovariance  at  lag  r  and  the  weights  axe  proportional  to 
lr<t,  (1  —  |r|/fc)lr<i  and  (1  —  (|r|/k)2)lr<jt  respectively.  Anderson  also  discusses  several 
other  spectral  density  estimators,  which  could  also  be  turned  into  k-NN  weight  functions. 

In  forecasting,  one-sided  exponential  nearest  neighbor  weights  are  used  in  what  is 
called  exponential  smoothing  (Chatfield  1980).  In  that  application  a  time  series  is  observed 
at  equally  spaced  points  (so  ranks  correspond  naturally  to  actual  time  elapsed)  and  the 
weights  Me  placed  on  the  present  and  past  to  forecast  the  future.  These  weights  have  the 
advantage  of  providing  easily  updatable  regression  functions.  In  the  one-sided  case,  after 
some  startup,  the  regression  estimate  at  i,-  is  almost  exactly  m( i,)  =  pYi  +  (1  —  p)m(ii_i). 
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In  the  twosided  case  the  regression  estimate  is  obtained  by  taking  a  weighted  average  of 
the  left  and  right  sided  exponential  smooths. 


2.2.4  Local  Linear  Regression  Weights 

An  important  class  of  weighting  schemes  are  the  linear  regression  weights.  When  y 
is  one  dimensional  the  regression  function  at  x  may  be  estimated  by  a  linear  regression  on 
the  points  in  the  neighbor  set  of  x.  The  estimate  of  the  regression  is  a  linear  combination 
of  the  Y  values  in  the  neighborhood,  and  the  weights  of  the  linear  combination  may  be 
thought  of  as  generating  an  estimate  Fx  of  Fx.  When  Wi  are  probability  weights  and  X 
is  JR,  the  weights  obtained  from  a  Wj-weighted  regression  of  Y  on  A  are 


Wi  =  W{  (l  + 


(x  -  x)  (Xj  -  x)\ 
8  a  ) 


(6) 


where  x  =  X}  and  a1  =  X)  Wipfc  —  x)2.  (If  s  =  0  take  W,-  =  W,-.)  When  the  Wi  are 
uniform  (l/k  for  k  points,  0  for  the  others)  the  W,  resemble  a  kernel  with  a  trapezoidal 
shape,  the  height  and  slope  of  which  depend  on  x,  8  and  k.  The  W,  sum  to  1  but  can 
include  some  negative  weights  when  x  is  not  near  the  mean  of  Wx  as  must  happen  for  x 
near  the  ends  of  the  data.  For  other  shapes  the  linear  regression  "kernel”  is  the  product 
of  the  original  weight  function  and  a  trapezoidal  function  that  depends  on  the  Jf’s  and 
the  original  weights  through  x  and  a. 

The  motivation  for  linear  regression  weights  is  that  they  preserve  linear  structure  in 
the  data.  This  is  especially  valuable  at  the  ends  of  the  observed  sample  where  simple 
weighted  averages  flatten  out  any  trend.  Friedman  and  Stuetzle  (1983)  use  regressions 
over  symmetric  uniform  nearest  neighbors  for  several  different  k  to  arrive  at  an  estimate 
of  the  regression.  See  also  Friedman  (1984).  McDonald  and  Owen  (1986)  use  uniform 
nearest  neighbor  linear  regressions  from  several  different  k  values  for  left,  right  and  sym¬ 
metric  windows.  Linear  regression  weights  with  uniform  symmetric  nearest  neighbors  are 
updatable  and  hence  can  be  computed  in  O(n)  operations  assuming  the  data  are  sorted 


on  x. 
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Stone  (1977)  states  that  the  local  linear  weights  are  not  necessarily  consistent  and 
shows  how  to  “trim”  them  to  achieve  consistency.  The  trimming  tends  to  remove  their 
utility  at  the  ends  of  the  data  and  in  the  middle  of  the  data  there  is  usually  not  much 
difference  between  linear  regression  weights  Wi  and  the  weights  W,-  on  which  they  are 
based  (at  least  in  the  symmetric  uniform  case).  Marhoul  and  Owen  (1985)  study  some  of 
the  asymptotics  of  regression  estimates  based  on  linear  regression  weights  on  symmetric 
and  one-sided  neighborhoods.  The  balance  implicit  in  symmetric  nearest  neighbor  sets  is 
exploited  in  their  proof  of  the  mean  square  consistency  of  running  linear  fits  over  such 
sets;  the  proof  would  not  go  through  for  linear  fits  over  ordinary  nearest  neighbor  sets. 
The  mean  square  consistency  holds  for  one-sided  windows  that  contain  A:  —  1  points  from 
one  side  of  the  target  and  1  point  that  is  either  at  the  target  or  on  the  other  side. 

Stone  (1977)  gives  the  generalization  of  (6)  for  linear  regression  weights  when  X  =  JRd. 

Linear  regressions  from  symmetric  and  one-sided  uniform  nearest  neighbor  weights 
are  updatable  and  linear  regression  versions  of  exponential  smoothing  are  also  updatable. 

2J2.5  Smoothing  Splines 

When  X  =  y  =  JR,  the  smoothing  spline  estimator  of  the  regression  of  Y  on  X  is 
that  function  j(-)  that  minimizes 

^  2  (y*  ~  S(*))J  +  A  /  9"(*)2dx  (7) 

*5=1  * 

where  A  >  0  is  given.  The  solution  g(x)  is  a  cubic  spline  with  knots  at  the  observations 
by  a  variational  argument  of  Reinsch  (1967)  and  moreover  can  be  written  as  a  linear 
combination  (Wahba  1975)  of  Y’s 

n 

9{x)  =  ^2G{x,i)Yi  (8) 

*5=1 

where  for  each  t,  G  provides  a  function  on  X  and  for  each  x,  G  provides  a  vector  of  weights. 
The  smoothing  spline  fits  into  the  framework  of  equation  2.1.4  by  setting  W,  =  G(x,  t).  In 
principle  this  gives  spline  estimates  of  Fx,  although  the  G(x,i)  are  difficult  to  compute. 
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Silverman  (1984)  developes  an  asymptotic  approximation  to  G  in  terms  of  a  variable 
kernel: 


G(x,t) 


1  1  1  (X-Xi\ 

nf{xi)h{xi)K  V  M*0  / 


(9) 


where 

h(Xi)  =  A  1/V(*.)~1/4 


and 

k(v)  =  |  exp  (-|v|/\/2)  sin  (\v\/V2  +  jt/4^ 
and  /  is  the  (well-behaved)  density  of  X. 

For  a  summary  of  spline  smoothing  see  Silverman  (1985)  and  Wegman  and  Wright 
(1983). 


2.2.6  Other  Weights 

A  variation  on  kernel  weights  due  to  Priestley  and  Chao  (1972)  uses  weights  propor¬ 
tional  to 


where  the  observations  are  arranged  so  that  the  sequence  (x,)  is  nondecreasing.  The 
Priestley-Chao  weights  modify  the  Nadaraya- Watson  weights  so  that  closely  spaced  points 
get  relatively  less  weight  and  more  widely  separated  points  get  relatively  more  weight. 
Gasser  and  Muller  (1977)  show  that  the  weights  in  (10)  have  a  smaller  asymptotic  mean 
square  error  than  do  ordinary  kernel  weights  in  the  case  of  equidistant  and  nearly  equidis¬ 
tant  designs. 

The  kriging  technique,  popular  in  geostatistics,  estimates  a  regression  function  (usu¬ 
ally  over  two  or  three  dimensions)  by  a  weighted  combination  of  observations,  the  weights 
depending  on  proximity  to  the  target  point  and  upon  an  assumed  covariance  structure  for 
the  observations.  Therefore  at  least  superficially  it  can  be  expressed  via  equation  2.1.4 
and  the  weights  used  to  estimate  conditional  distribution  functions.  For  a  discussion  of 
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kriging  see  Ripley  (1981)  or  Yakowitz  and  Szidarovszky  (1985)  who  compare  it  to  kernel 
nonpar ametric  regression.  Watson  (1984)  shows  that  spline  regression  estimation  can  be 
obtained  as  a  special  cue  of  kriging. 

The  regression  trees  of  Brieman  et.  al.  (1984)  could  be  used  to  estimate  Fx  by  putting 
equal  weight  on  all  the  observations  in  each  node.  That  estimate  would  be  used  for  all  the 
predictor  values  that  lead  to  the  node.  Since  the  splits  made  by  the  recursive  partitioning 
algorithm  depend  in  a  complicated  way  on  the  Y  values,  so  do  the  resulting  weights.  For 
this  reason  they  do  not  fit  into  the  framework  considered  here. 

Another  smoothing  technique  that  does  not  fit  into  the  present  context  is  the  iterative 
application  of  running  medians  in  Tukey  (1977).  A  single  running  median  may  be  inter¬ 
preted  as  the  conditional  median  function  when  uniform  symmetric  k-NN  weights  are  used, 
but  iterative  application  of  such  an  algorithm  would  be  quite  unnatural  if  not  impossible 
to  interpret  as  a  functional  applied  to  an  estimate  of  the  conditional  distribution. 

Wandering  schematic  plots  (Tukey,  1977)  sure  in  the  spirit  of  this  work,  however.  They 
are  formed  by  partitioning  the  X-axis  into  bins  and  computing  sample  statistics  for  the 
Y  values  that  appear  in  each  bin.  The  resulting  values  are  plotted  above  the  bin  medians. 

2.2.7  Bandwidth  Selection 

In  all  of  the  above  weighting  schemes  there  is  a  parameter  k  or  bn  or  A  that  governs 
the  rate  at  which  the  weight  drops  off  as  the  distance  from  X»  to  x  increases.  In  each  case 
larger  values  of  the  parameter  result  in  more  spread  out  weights  and  the  corresponding 
regression  estimates  are  smoother  looking.  We  use  the  term  bandwidth  to  refer  to  any  of 
these  quantities.  Smaller  bandwidths  give  rise  to  regression  curves  that  pass  closer  to  the 
data.  In  general  a  regression  estimated  with  a  small  bandwidth  is  subject  to  less  bias  and 
more  variance  than  when  a  large  bandwidth  is  used.  The  bandwidth  to  be  used  can  be 
selected  by  plotting  the  results  for  a  few  choices  and  selecting  the  one  that  seems  best. 

For  reasons  expressed  well  in  Silverman  (1985  Sec.  4)  it  is  desirable  to  have  available 
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an  automatic  technique  for  bandwidth  selection.  The  cross-validation  method  of  Stone 
(1974)  is  commonly  used  for  this.  The  idea  is  to  choose  the  bandwidth  that  minimizes 
cross-validated  squared  error.  See  Friedman  and  Stuetzle  (1983)  who  use  crossvalidation 
to  select  k  for  a  linear  regression  over  a  uniform  symmetric  k-NN  neighborhood,  Hall 
(1984)  who  studies  asymptotics  for  the  cross-validated  kernel  regression,  and  Wahba  and 
Wold  (1975)  for  cross-validation  in  smoothing  splines.  Craven  and  Wahba  (1979)  provide 
a  faster  alternative  to  cross-validation,  known  as  generalized  cross-validation.  Friedman 
and  Stuetzle  perform  a  local  cross-validation  so  that  the  bias-variance  tradeoff  implicit  in 
a  choice  of  k  can  be  made  for  each  x. 

Titterington  (1985)  surveys  smoothing  techniques  in  statistics  including  image  pro¬ 
cessing  and  mentions  some  alternatives  to  cross-validation  including  minimum  risk  and 
Bayesian  methods.  In  minimum  risk  strategies,  the  mimimizing  bandwidth  for  a  risk 
function  is  obtained  or  approximated  by  a  closed  form  expression.  Typically  such  an  ex¬ 
pression  would  involve  the  underlying  curve  and  an  approximation  to  that  curve  would  be 
substituted. 

Bandwidth  selection  techniques  do  not  usually  fit  into  equation  2.1.4  since  the  Y 
values  are  used  to  select  the  bandwidth.  When  the  dependence  is  very  simple  however  as 
in  the  case  of  a  choice  of  bandwidth  from  a  finite  set  of  consistent  bandwidths  the  results 
of  Chapters  3  and  4  are  easy  to  apply. 

If  a  bandwidth  choice  is  made  and  used  to  obtain  Wx  and  then  all  functionals  of 
interest  sure  computed  with  the  estimate  Fx  then  many  natural  relationships  between 
different  functionals  will  hold  for  the  estimates.  For  example 

2(9{Y)  +  h(Y)\X  =  x)  =  £(g{Y)\X  =  z)  +  £(h(Y)\X  =  x) 

and 

Var(Fx)  =  J(y-m(Fx))2K(dy) 

and 

Var{Fx )  =  2  (Y2  |  X  =  i)  -  2  {Y  \  X  =  x)2 
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will  hold.  For  probability  weights  Ws  the  estimated  quantiles  are  properly  ordered  (in 
particular  quantile  regressions  will  not  “cross”  ) 

K  | IV  -  rn(Fx)|  >  k\Jv ar(F’x)|  <  1/A:2, 

so  that  a  pointwise  Chebychev’s  inequality  will  hold  and  so  on.  Such  self  consistency 
properties  of  the  estimates  are  desirable  though  they  may  entail  some  cost:  the  best 
bandwidths,  in  squared  error  terms  say,  may  differ  from  functional  to  functional.  For 
example  one  might  do  better  with  larger  bandwidths  for  variances  and  extreme  quantiles 
than  for  means  and  moderate  quantiles  respectively.  In  practice  it  should  often  be  rea¬ 
sonable  to  pick  the  bandwidth  to  estimate  a  particular  functional  such  as  the  mean  and 
then  use  those  weights  for  all  other  functionals. 

2.3  Statistical  Functionals 

A  statistical  functional  is  a  mapping  defined  on  a  space  of  distribution  functions.  Usually 
the  image  space  is  IR  but  it  could  also  be  a  set  of  categories  or  a  higher  dimensional 
Euclidean  space.  The  domain  usually  includes  all  empirical  distribution  functions  and  the 
hypothetical  true  distribution.  Statistical  functionals  are  a  convenient  abstraction;  they 
apply  in  most  statistical  situations  and  allow  the  use  of  concepts  and  techniques  from 
analysis. 

Many  quantities  of  interest  to  statisticians  can  be  expressed  as  statistical  functionals 
T(F)  where  F  is  the  distribution  of  the  data.  The  natural  estimate  of  T(F)  is  often  T[Fn) 
where  Fn  is  the  sample  distribution  function.  For  example,  the  sample  mean  is  m(Fn). 

Most  calculations  that  statisticians  perform  on  a  set  of  data  can  be  expressed  as 
statistical  functionals  on  Fn.  Any  function  of  n  i.i.d.  observations  is  a  function  of  a  list 
of  the  observed  values  (sorted  for  example)  and  a  permutation  that  labels  them.  Most 
statistical  computations  make  no  use  of  the  labelling  of  the  observations  (except  perhaps 
to  check  independence  or  identity  of  distribution)  and  hence  depend  only  on  the  list  of 
observations.  The  list  of  observations  is  determined  by  Fn  and  n.  The  sample  size  n 
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cannot  be  determined  from  Fn.  Statistical  computations  tend  to  depend  more  on  Fn  than 
on  n.  Many  statistics  do  not  depend  on  n  at  all.  For  example  the  variance  is 

V(F)  =  /(y-m(F))JdF(y), 

the  median  is 

Q.s(F)  =  inf{g  :  /  dF(y)  >  .5} 

—  OO 

and  an  M-estimate  M(F)  may  be  obtained  as  a  solution  M  of 

0=  Jtf(y-M)  dF(  y). 

The  most  commonly  cited  statistic  that  depends  on  n  is  the  unbiased  sample  variance: 

=  ^V(F). 

In  this  and  similar  cases  an  auxiliary  parameter  may  be  introduced  for  the  sample  size. 
The  functional  is  then  defined  on  U  x  [0,  oo]  where  U  is  a  space  of  distributions.  The 
sample  value  is  T(Fn,n )  and  the  population  value  is  T(F,  oo).  The  analytic  properties  of 
such  sequences  of  functionals  can  be  considered  on  this  augmented  space.  For  more  on 
auxiliary  parameters  see  Reeds  (1976,  Sec.  1.6).  In  particular  Reeds  considers  bivariate 
Taylor  series  expansions  of  functionals  whose  first  argument  is  a  distribution  and  whose 
second  argument  is  an  auxiliary  parameter. 

Many  important  properties  of  statistics  may  be  expressed  in  terms  of  analytic  prop¬ 
erties  of  statistical  functionals.  A  statistical  functional  T(Fn)  is  robust  at  F  according  to 
Hampel  (1971)  if  C(T(Fn))  as  a  function  of  the  distribution  F  of  a  single  observation  is  a 
continuous  function  at  F  when  the  Prohorov  metric  is  used  in  the  spaces  for  both  F  and 
£(T).  The  augmented  statistical  functional  T(Fn,  n)  is  robust  if  the  continuity  is  uniform 
in  the  auxiliary  parameter.  Hampel  shows  that  if  T(-)  itself  is  continuous  at  F  then  it  is 
robust  at  F.  His  definition  of  continuity  of  an  augmented  functional  is  essentially  that  of 
bivariate  continuity  at  (F,  oo)  although  to  avoid  assuming  the  existence  of  T(F,  oo)  he  uses 
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a  version  of  the  Cauchy  criterion.  It  is  important  to  note  that  robustness,  like  continuity, 
depends  on  both  the  functional  and  the  point  in  the  domain  under  consideration.  The 
mean  is  not  continuous  at  any  F.  The  median  is  not  continuous  at  F  if  F~l{l/2}  is  an 
interval  of  positive  length,  and  hence  is  not  robust  at  that  F  either. 

The  influence  curve  is  a  form  of  derivative  of  a  functional.  The  use  of  Taylor  expansions 
of  statistical  functionals  to  prove  asymptotic  normality  is  known  as  Von  Mises’  method. 
See  Sec.  2.6  for  a  discussion. 

If  one  can  obtain  results  based  only  on  analytic  properties  of  the  functionals  used  then 
they  may.  apply  easily  to  as  yet  unknown  statistical  methods.  For  example,  in  Chapter  3 
some  consistency  results  for  running  functionals  require  only  Prohorov  continuity  of  the 
functional.  They  therefore  apply  to  any  robust  functioned. 

Another  advantage  of  functionals  is  that  there  is  often  a  natural  extension  to  spaces 
that  contain  more  that  just  distribution  functions.  The  space  of  all  finite  signed  measures 
is  such  an  extension  as  are  C[0, 1],  D[0, 1]  and  £^[0, 1].  Such  spaces  are  vector  spaces  and 
hence  are  easier  to  work  with,  in  the  same  way  that  it  is  easier  to  work  in  Euclidean  space 
than  in  a  simplex.  The  functionals  for  the  mean,  median,  variance  and  the  M-estimators 
can  be  extended  meaningfully  to  larger  spaces.  Estimators  of  Fx  that  put  a  small  amount 
of  negative  weight  on  some  observations,  perhaps  to  reduce  bias,  can  be  handled  naturally 
by  extending  the  domain  of  the  functionals. 

2.4  Statistical  Metrics 

This  section  presents  some  of  the  more  useful  statistical  metrics  and  discusses  their  prop¬ 
erties.  A  familiarity  with  metrics,  norms,  the  topologies  they  induce  and  the  associated 
definitions  of  continuity  and  convergence  is  assumed.  These  concepts  are  readily  found  in 
introductory  books  on  topology,  such  as  Willard  (1970). 

Throughout  this  section,  U  is  a  space  of  distributions.  They  are  defined  as  probability 
measures  on  a  measure  space  (Cl,  >1),  with  the  important  special  case  (ZR,  S),  where  B  is 
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the  Borel  er- field.  Sometimes  it  is  convenient  to  extend  U  to  include  finite  signed  measures 
or  to  restrict  to  measures  satisfying  a  moment  condition.  F ,  G,  H,  Fn  and  Gn  are  elements 
of  U.  F  will  be  a  bone  fide  probability  and  Fn  will  denote  the  empirical  probability  from 
a  sample  of  size  n  from  F.  G  and  H  are  general  members  of  U  and  Gn  is  a  sequence  in  U. 
On  1R  the  letter  used  to  denote  the  measure  will  also  be  used  for  the  distribution  function 
so  that  for  example  F(x)  =  .F((-oo,  x]). 

If  a  statistical  functional  T  is  continuous  at  G  when  a  metric  p  is  used  on  U  and 
if  p(Gn,G )  — ►  0  then  T(Gn)  —*  T(G).  The  same  is  true  if  both  — +’s  are  replaced  by 
almost  sure  convergence  or  by  weak  convergence.  (This  is  proved  in  Lemma  3.1.1.  It  is 
not  true  for  IS  convergence.)  Therefore  consistency  of  Gn  for  G  implies  consistency  for  a 
potentially  large  class  of  statistical  functionals. 

Recall  that  a  metric  p\  on  U  is  stronger  than  pi  (also  on  U)  if  every  open  /32-ball 
around  a  point  in  U  contains  an  open  pi-ball  around  the  same  point.  A  sequence  that 
converges  in  the  p\  metric  converges  in  the  pi  metric.  Any  continuous  function  on  the 
metric  space  ( U,p 2)  is  continuous  on  ( U,p\ ).  Any  continuous  function  with  range  ([/,  p{) 
is  continuous  with  range  (17,  pi). 

2.4.1  Prohorov  Metric 

Let  Q  be  a  complete  separable  metric  space  with  Borel  sigma  field  M  and  metric  d. 
The  most  important  case  i&Cl  =  y=lR,  M  =  8  and  d(x, y)  =  \x  —  y|.  For  e  >  0  and 
AC  y  define 

A*  =  {y  €  n  :  d(y,  A)  <  e}  (1) 

where  d( y,  A)  =  inf^g^  d(y,z).  Let  G  and  H  be  finite  measures  on  (fl,  M)  and  define  the 
distance  from  G  to  H : 

3t(G,  H)  =  inf{e  >  0  :  G(A)  <  H{At )  +  e,  VA  €  At}.  (2) 

Now  define 

Proh{G,  H)  =  max{jr(G,  H),ir(H,  G)}.  (3) 
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This  definition  is  the  one  given  by  Prohorov  (1956)  except  that  in  (2)  Prohorov  uses  only 
closed  sets  A.  The  definitions  are  equivalent  because  for  each  Borel  set  A  and  rj  >  0  there 
is  a  closed  set  B  C  A  with  G(B\A)  <  r\.  Prohorov  (1956)  shows  that  the  space  of  finite 
measures  on  (0,  At)  with  the  distance  function  Proh  is  itself  a  complete  separable  metric 
space  and  that  Proh(Gn,G)  — >  0  iff  Gn  — ►  G  in  the  sense  of  weak  convergence.  That  is 

Proh(Gn,G)  —  0 

iff  for  every  bounded  continuous  function  <p  from  ft  to  1R 

J  <p{y)dGn{y)  ->  J  <p(y)dG(y). 

The  Prohorov  metric  is  prominent  in  the  robustness  literature.  It  is  usually  defined 
there  on  probability  measures.  For  measures  of  equal  total  mass  n  is  a  metric  and  metrizes 
weak  convergence.  In  particular  *r  is  symmetric  so  Proh  =  x  on  probability  measures. 
See  Huber  (1981). 

When  two  measures  have  almost  the  same  mass  x  is  almost  symmetric  as  the  following 
lemma  shows. 

Lemma  2.4.1.  Let  G  and  H  be  measures  on  (ft,  X)  with  G(ft)  >  H(C1).  Then 
x(H,  G)  <  x{G,  H)  <  x(H,  G)  +  G(ft)  -  H(ft). 

PROOF.  Argue  as  Huber  (1981,  p.27)  does  in  the  special  case  of  probability  measures. 
Let  x(H,G)  =  e  and  let  J  >  e.  Consider  A  =  B*'c  in  the  definition  of  ic(H,G),  where  a 
superscript  c  denotes  complementation.  One  obtains 

F(ft)  -  H{B(>)  <  G(ft)  -  G(Be'ete )  +  c 

so  that 

G{Bt'ete)  <-H{B*')  +  e  -I-  G(ft)  -  ff(ft). 

Because  B  C  Bf  cte , 

G(B )  <  H{Bt>)  +  e  +  G(ft)  -  tf(ft). 
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Letting  d  |  e 

G(B)<ff(£«)  +  e  +  G(n)-2r(n).  (4) 

From  (4)  x(G,  H)  <  x(H,  G)  +  G(fi)  -  F(Q).  Equation  (4)  was  derived  without  using 
G(ft)  >  F(Q)  so  it  still  holds  when  the  roles  of  G  and  H  are  reversed.  From  this  jr (H,  G)  < 
*(G,  H)  and  the  lemma  is  proved.  | 

Corollary.  If  Gn(fl)  — *  G{Cl)  then  the  following  are  equivalent: 

(i)  x(Gn,G)  — »  0 

(ii)  3r(G,Gn)  — ►  0 

(iii)  Proh{G,G„)  —  0 

PROOF.  Immediate  from  the  lemma  and  (3).  | 

For  probability  measures  G„  and  G  Billingsley  (1971)  gives  these  equivalent  charac¬ 
terizations  of  weak  convergence  of  Gn  to  G: 

a)  lim  supG„(A)  <  G(A)  V  closed  A 

b)  lim  inf  Gn(A)  >  G(A)  V  open  A 

c)  lim Gn(A)  =  G(A)  VA  with  G(dA)  =  0 

For  finite  measures  the  above  sure  all  equivalent  to  Proh(Gn,  G)  — ►  0  (Prohorov  1956,  Sec. 
1.3)  if  the  condition  limGn(f2)  =  G(fi)  is  adjoined  to  a)  and  b). 

Hampel  (1971)  uses  the  Prohorov  metric  to  define  robustness  of  a  statistical  functional. 
His  definition  is  that  the  map  from  the  distribution  of  the  data  to  the  distribution  of 
the  functional  is  continuous  (uniformly  in  n)  when  the  Prohorov  metric  is  used  on  both 
spaces.  Hampel’s  theorem  for  a  statistical  functional  is  that  it  is  robust  if  and  only  if  it  is 
a  continuous  mapping  from  the  space  of  distributions  to  1R  where  the  Prohorov  metric  is 
used  on  the  space  of  distributions. 

Any  quantile  is  a  Prohorov  continuous  functional  at  any  distribution  that  has  positive 
mass  in  sill  open  intervals  about  the  quantile.  An  M-estimate  with  a  bounded  and  strictly 
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monotone  ip  function  is  Prohorov  continuous  at  every  distribution.  The  functional 

T,(F)  =  F{y) 

for  fixed  y  is  Prohorov  continous  at  every  F  for  which  y  is  a  continuity  point.  The  a- 
trimmed  mean  with  0<a<l/2is  Prohorov  continuous  at  every  distribution.  More 
generally  an  L-estimate 

T(F)=  f  F~1(u)M(du) 

Jo 

where  M  is  a  finite  signed  measure  with  support  in  [a,  1  —  a]  is  Prohorov  continuous  at 
any  F  for  which  no  discontinuity  point  of  F~l  is  a  point  of  mass  of  M. 

Many  important  functionals  are  not  Prohorov  continuous.  That  is  to  say  they  are 
not  robust.  In  particular  the  mean  is  not  continuous  at  any  distribution  function.  Higher 
moments  and  related  quantities  such  as  the  standard  deviation,  correlation  and  coefficient 
of  variation  are  not  continuous  anywhere.  Similarly  F(y)  —  F(y—),  the  jump  of  F  at  y  is 
not  Prohorov  continuous  for  any  F  with  an  atom  at  y. 

The  mean  can  be  made  continuous  by  considering  a  smaller  space  U.  For  example, 
on  a  space  of  distributions  with  uniformly  bounded  support,  all  moments  are  Prohorov 
continuous.  If  for  1  <  p  <  q  the  distributions  in  U  have  a  uniformly  bounded  g’th  moment, 
then  the  p’th  moment  is  Prohorov  continuous.  (Chung,  1974,  Theorem  4.5.2.) 

In  Chapter  3,  one  of  the  conditions  used  is  that  F.  as  a  map  from  X  to  U  is  Prohorov 
continuous.  In  other  words  as  x'  — *  x  the  distribution  of  Y  given  X  =  x1  converges  weakly 
to  the  distribution  of  Y  given  X  =  x.  This  sort  of  continuity  assumption  would  seem  to 
be  very  mild  in  practice. 

In  order  to  study  weight  sequences  with  some  negative  weights  it  would  be  useful  to 
have  a  metric  for  weak  convergence  of  finite  signed  measures.  Unfortunately  no  metrization 
of  weak  convergence  exists  for  signed  measures,  except  in  trivial  cases.  See  Choquet  (1969, 
Vol.  I,  Sec.  12  and  Theorem  16.9).  (It  is  possible  to  metrize  weak  convergence  of  signed 
measures  on  some  spaces  without  compact  sets  of  infinite  cardinality.  ) 
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Recall  that  a  finite  signed  measure  G  can  be  written  G  =  G+  —  G~  where  G+  and  G~ 
are  mutually  singular  measures  called,  respectively,  the  positive  and  negative  parts  of  G. 
The  measure  |G|  =  G+  +  G~  is  the  total  variation  of  G  (This  is  the  Jordan  decomposition, 
and  it  is  unique.) 

The  quantity  Proh  defined  by  (3)  is  peculiar,  on  finite  signed  measures.  It  is  almost 
a  metric,  but  sometimes  Proh[G,G)  >  0.  Furthermore  the  triangle  inequality  might  not 
hold  if  the  space  fl  is  ill-behaved.  (The  triangle  inequality  holds  if  (Ba)b  =  Ba+b  for  all 
B  €  M  and  a,b  >  0.)  Convergence  of  (3)  need  not  imply  weak  convergence: 

Example  1.  Let  G  —  0  and  Gn  =  n2^/,*  —  n26_i/n.  Then 

max{T(G„,  G),  jt(G, Gn)}  =  2/n  -►  0 

but  Gn  does  not  converge  weakly  to  G.  Consider  <p(x)  =  1  A  (z  +  1)+.  /  (p(x)dGn(x )  =  n 
and  /  <p(x)dG(x)  =  0. 

Convergence  of  (3)  combined  with 

limsup|G„|  <  B  <  oo 

can  be  shown  to  imply  weak  convergence — first  establish  convergence  for  the  signed  mea¬ 
sures  of  closed  sets  and  then  extend  to  bounded  continuous  functions  as  in  Pollard  (1984, 
Exercise  IV- 12). 

We  can  define  a  metric  that  is  stronger  than  weak  convergence.  For  finite  signed 
measures  G  and  H  on  (fl,  X)  define 

Proh(G,H )  =  Proh(G+ ,  H+)  +  Proh(G~  ,H~).  (5) 

Proh  as  defined  by  (5)  is  still  a  metric  and  Proh(Gn,G)  —*  0  implies  weak  convergence  of 
Gn  to  G. 

It  is  possible  for  Gn  to  converge  weakly  to  G  without  Proh(Gn,G)  (as  defined  in  (5)) 


converging  to  zero. 
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Example  2.  Let  G  =  Sq  and  Gn  =  2<5o  -  Si/n.  Then  Gn  converges  weakly  to  G  but 

Proh(Gn,G)  =  2. 

2.4.2  Kolmogorov- Smirnov  Metric 

The  Kolmogorov-Smirnov  metric  for  distributions  on  1R  is 

KS(G,  H)  =  sup  |G(y)  —  -ff(y)|,  (6) 

v 

the  sup  norm  of  G  —  H.  It  takes  its  name  from  the  Kolmogorov-Smirnov  statistic 
KS(Fn,  F).  The  space  U  can  be  any  set  of  functions  on  IR.  This  makes  it  a  conve¬ 
nient  metric  to  use  when  considering  distribution  functions  corresponding  to  finite  signed 
measures. 

The  Glivenko-Cantelli  theorem  states  that  KS(F,  Fn)  —*  0  a.s.  as  n  — ►  oo.  In  Chapter 
3  sufficient  conditions  are  given  for  KS(FX,  Fx)  —*  0  a.s. 

The  metric  if  5  is  stronger  than  Proh.  That  is 

KS{Gn,  G)  -*  0  =>  Proh(Gn,  G)  -►  0, 

and  there  are  sequences  for  which  Proh{Gn,G)  — »  0  but  KS{Gn,G )  does  not  converge 
to  0.  If  KS(Gn,G)  — »  0  the  distribution  functions  G„  are  converging  uniformly  to  G 
whereas  if  Proh(Gn,  G)  —>  0  the  convergence  is  pointwise  at  continuity  points  of  G.  If  Gn 
is  a  point-mass  at  1/n  and  G  is  a  point-mass  at  0,  Prohorov  but  not  Kolmogorov-Smirnov 
convergence  takes  place. 

All  the  functionals  that  are  continuous  under  the  Prohorov  metric  are  continuous 
under  the  Kolmogorov-Smirnov  metric.  Under  this  stronger  metric,  the  jump  functional 

J,{F)  =  F{y)  -  F (»-) 


is  continuous  everywhere.  The  mean  is  nowhere  continuous. 
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Suppose  that  the  map  F,  from  X  to  U  is  KS  continuous.  Then  the  function  in  the 
xy— plane  given  by  Fx(y)  is  continuous  (uniformly  in  y)  when  traversed  parallel  to  the  x 
axis,  but  need  not  be  continuous  at  all  when  traversed  parallel  to  the  y  axis. 

Example  1.  If  A  (x)  >  0  is  a  continuous  function  and  Fx  is  the  Poisson  distribution  with 
parameter  A(x)  then  F,  is  KS  continuous.  If  Y/ A(z)  has  the  Poisson  distribution  with 
parameter  1  then  F .  is  not  KS  continuous  unless  A  is  constant. 

A  KS  continuous  F,  can  have  atoms  of  fixed  location  in  y  whose  size  varies  continu¬ 
ously  with  x  but  cannot  have  atoms  of  fixed  size  whose  locations  vary  continuously. 

The  weight  function  Wx  will  not  usually  converge  in  the  KS  metric  to  Sx.  For  a 
symmetric  kernel  and  i.i.d.  X{  from  a  distribution  without  an  atom  at  x ,  KS(WX,  Sx)  =  .5 
except  for  end  effects. 

2.4.3  Vasserstein  Metrics 

These  metrics  are  described  in  Bickel  and  Freedman  (1981,  Section  8).  This  section 
is  based  on  their  account.  Let  B  be  a  separable  Banach  space  with  norm  ||-||.  (This  is  the 
space  y  which  the  reader  might  assume  is  2R.)  For  1  <  p  <  oo  let  U  =  Up  be  the  space  of 
probability  measures  F  on  the  Borel  a-field  of  B  for  which  /  ||y||,>  F(dy)  <  oo.  Then  the 
Vasserstein  metric  is  the  infimum  of  £  (  ||Af  —  Y |JP  over  all  pairs  of  random  variables 
X  and  Y  with  X  ~  F  and  Y  ~  G.  Bickel  and  Freedman’s  Lemma  8.1  establishes  that  Vp 
is  a  metric  and  that  the  infimum  is  attained. 

The  Vasserstein  metric  provides  a  way  of  metrizing  IP  convergence.  Bickel  and  Freed¬ 
man’s  Lemma  8.3  states  that  Vp(Gn,G )  — *■  0  if  and  only  if 

Gn—>G  weakly,  and  J  ||y||p  Gn(dy)  -*  J  ||y||pG(dy). 

Clearly  Vp  convergence  implies  Prohorov  convergence.  In  fact  Proh(F,  G)  <  y/V\ (F,G),  a 
result  due  to  Dobrushin  (1970).  Also  for  distributions  in  Up  where  p  >  p’  >  1,  Vpi(F,  G )  < 
VP(F,G). 
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For  B  —  JR  with  norm  |  •  |  there  is  a  convenient  formula  for  Vv  due  to  Major  (1978): 

V,(/,G)  =  (jf‘ !?-'(«)  -  G-‘(«)N«)  '  (7) 

so  that  Vp  is  a  “sideways  IP”  metric.  In  particular  Vi  (F,G)  is  the  area  between  the  d.f.s 
F  and  G  and  hence  may  also  be  written: 


Vi  {F,G)  =  £ 


\F(y)-G(y)\dy. 


The  metric  is  also  known  as  the  Mallows  metric.  Mallows  (1972)  used  the  form  (7) 
and  established  that  convergence  in  the  Mallows  metric  is  equivalent  to  combined  weak 
and  L 2  convergence. 

It  is  natural  to  adjoin  a  metric  based  on  essential  suprema.  Define 


ess  sup  F  =  sup{B  >  0  :  F{||F|[  >  B}  >  0} 

and  let  Uoo  be  the  set  of  probability  measures  with  finite  essential  suprema.  Then  define 
for  Fy-G  £  Uoo 

Voo(F,G)  =  inf  ess  sup  \\X  -  V|| 

where  the  infimum  is  taken  over  pairs  X  ~  F  and  Y  ~  G.  In  the  case  of  B  =  1R, 

Voo(F,G)=  sup  IF-'(u)  -  G-X(u)|.  (8) 

0<u<l 

It  is  clearly  a  metric  since  it  is  the  sup  norm  of  F~l  —  G~1.  Also  Voo  convergence  implies 
Vp  convergence  for  all  finite  p  >  1.  The  form  (8)  will  be  used  to  define  a  metric  on 
the  set  of  all  probability  distribution  functions,  not  just  those  with  bounded  support.  The 
resulting  metric  may  take  infinite  values. 

Convergence  of  V0 0(Gn,G)  to  0  implies  that  Proh(Gn,G)  —*  0  and  ess  sup Gn  — * 
ess  sup  G .  The  converse  does  not  hold  as  the  next  example  illustrates. 

Example  1.  Let  Gn  be  uniform  on  the  set  [0, 1  +  1/n]  U  [2  +  1/n,  3]  and  G  be  uniform  on 
[0,  l]u[2, 3].  Then  Gn  — ►  G  weakly  and  the  essential  suprema  converge  but  Voo  (G  G)  =  l. 
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KS  convergence  and  Vp  convergence  (for  B  =  2R)  are  not  comparable.  (One  is  tempted 
to  say  they  are  orthogonal.)  KS  convergence  does  not  imply  V\  convergence  and  V^, 
convergence  does  not  imply  KS  convergence. 

Example  2.  Take  B  =  JR,  G  =  6o  and  Gn  =  (1  —  l/n)6o  +  1  /nSn.  Then 

KS(Gn,G)  =  1/n  — »  0 

but  Vi(Gn,G)  =  1. 

Example  3.  Take  B  =  1R,  G  =■  8q  and  Gn  =  8i/n.  Then 

Voo(Gn,G)  =  l/n-0 

but  KS{Gn,G)  =  1. 

Vasserstein  metrics  are  useful  in  describing  the  distance  of  Wx  from  8X,  when  W,- 
are  probability  weights.  For  example  Vi(Wx,  8X)  =  ^2Wi  ||X,-  -  *||,  the  weighted  average 
distance  of  the  observations  from  the  target  point.  Similarly  (Wx ,  8X)  is  the  greatest 
distance  from  x  of  any  point  used  in  Fx.  When  a  nonnegative  kernel  has  bounded  support 
and  the  bandwidth  tends  to  zero,  the  resulting  vector  of  weights  converges  in  V »  to  x. 
The  same  is  generally  true  of  nearest  neighbor  schemes  in  which  all  but  a  vanishingly 
small  proportion  of  the  observations  are  given  0  weight. 

When  Wx  is  not  a  probability,  it  is  still  convenient  to  use  the  Vasserstein  distance  as 
a  shorthand  notation  for  the  distance  between  Wx  and  8X.  Therefore  for  1  <  p  <  oo  define 

||X,  - 

and 

Voo (WX,S9)=  sup  \\Xi-z\\. 

Wijto 

The  Vasserstein  metrics  are  also  useful  in  manipulating  the  quantity  | Yi  -  Y?\,  the 
difference  between  Yi  and  “the  value  it  would  have  taken  had  X{  been  x” .  To  whit: 

£  (  IK  -  Y‘ \p  I*  =  *.’)=  ^  -  Ygl(u)\fdu  =  Vp(F„Fxiy. 

J  0 
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Therefore  if,  as  is  reasonable,  z,-  close  to  z  implies  Vp(FX)  FXi)  is  small,  the  bias  due  to 
using  an  observation  from  X  =  z<  instead  of  z  should  be  small. 

The  following  lemma  from  Bickel  and  Freedman  is  of  interest: 

Lemma  2.4.2.  Let  Yi  be  independent;  likewise  for  Z\\  assume  their  distributions  are  in 
Up,  1  <  p  <  oo.  Then 

v,  (£  (IX, «) .  C.  (££,  *))  <  E",  V,(£pi),  £(*)). 

PROOF.  Bickel  and  Freedman  (1981, Lemma  8.6). 

When  B  is  a  Hilbert  space,  Bickel  and  Freedman  (1981)  obtain  some  sharper  results 
for  the  Mallows  metric  Vj. 

2.4.4  Other  Metrics 

The  three  metrics  considered  above  are  the  ones  that  will  be  used  in  Chapters  3  and 
4.  This  section  rounds  out  the  discussion  of  statistical  metrics  with  some  other  metrics  in 
common  usage. 

The  Levy  metric  for  distributions  on  the  real  line  is 

Levy(F,G)  =  inf{c  >  0  :  F(x  —  c)  -  e  <  G[x )  <  F(x  +  e)  +  e  Vz}. 

This  metric  also  metrizes  weak  convergence.  It  hats  a  geometric  interpretation  as  l/y/2 
times  the  maximum  distance  between  the  distribution  functions  taken  in  the  northwest  to 
southeast  direction.  On  the  space  of  subprobability  measures  Gn  converges  to  G  in  the 
Levy  metric  if  and  only  if  Gn  converges  weakly  to  G  and  the  total  mass  Gn(lR)  converges 
to  G(1R)  (Chung,  1974,  p.94). 

The  bounded  Lipschitz  metric  (Huber,  1980)  also  metrizes  weak  convergence  on  com¬ 
plete  separable  metric  spaces.  Assume  the  metric  is  bounded  by  1.  If  necessary  replace 
the  metric  d(-,  •)  by  the  topologically  equivalent  d  (•,  -)/(l  +  d(-,  ■)).  Then  the  bounded 
Lipschitz  metric  is 


BLip(F,  G)  =  sup  |  /  <f>(y)dF(y)  -  J  <f>(y)dG( y)| 
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where  the  supremum  is  taken  over  functions  <f>  that  satisfy  |^(yi)  —  ^(yo)|  <  d(yi,  yo). 
Huber  (1980,  Ch.2)  shows  that 

Proh(F,G)2  <  BLip(F,G)  <  2 Proh(F,G). 

The  KS  metric  can  be  generalized.  Rewriting  it  as 

KS(F,G)  =  sup  |f(— oc,  y]  -  G(-oo,y]| 
v 

suggests  generalizations  in  which  the  supremum  is  taken  over  different  classes  of  sets. 
Taking  the  supremum  over  all  measureable  sets  yields  the  total  variation  metric: 

TV(F,G)  =  sup  |.F(A)  —  G(A)\ 

AeS 

a  very  strong  metric.  This  metric  is  so  strong  that  Fn  does  not  converge  to  F  in  total 
variation  when  F  has  a  continuous  part.  On  the  other  hand  it  is  not  strong  enough 
to  force  Vi  convergence  (see  Example  2.4.2).  There  are  many  ways  to  extend  the  KS 
metric  to  higher  dimensional  spaces.  In  finite  dimensional  Euclidean  spaces  the  most 
straightforward  is  to  take  suprema  over  lower  left  orthants.  Suprema  over  half  spaces  or 
closed  balls  may  also  be  considered.  For  convergence  of  Fn  to  F  to  hold  for  all  F  in  one  of 
these  metrics  requires  that  the  class  of  sets  over  which  the  supremum  is  taken  not  be  too 
rich.  A  further  generalization  is  to  extend  suprema  over  probabilities  of  sets  to  suprema 
over  expectations  of  functions.  For  a  discussion  see  Pollard  (1984,  Ch.  2). 

Bickel  and  Freedman  (1981)  show  that  VP(F,  G)  =  e  if  and  only  if  there  exist  random 
variables  X  ~  F  and  Y  ~  G  such  such  that 

s  ( ||X  -  rf =  e. 

Similar  coupling  results  hold  for  some  other  metrics:  Proh(F,G)  <  e  iff  some  such  X  and 
Y  satisfy 


P  {d(X,  Y)  <  e)  >  1  -  e 
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where  d  is  the  metric  on  ft,  BLip(F,  G)  <  e  iff  some  such  X  and  Y  satisfy 

£  (  d(X,Y)  )  <  e 

where  d  is  the  bounded  metric  used  to  define  BLip  and  finally  TV ( F ,  G)  <  e  iff  some  such 
X  and  Y  satisfy 

P(X  7^  Y)  <  e. 

The  first  and  third  of  these  follow  from  Strassen’s  theorem  (Huber,  1980)  and  the  second 
from  Huber’s  (1980)  generalization  of  a  theorem  of  Kantorovic  and  Rubinstein. 

2.5  Models  for  F, 

As  indicated  in  Sec.  2.1  the  X’s  are  obtained  either  by  sampling  or  by  design,  and  then  the 
F’s  are  conditionally  independent  with  the  corresponding  distributions.  Given  Xi  =  z,-  the 
distribution  of  Yi  is  FX{.  All  the  results  in  Chapters  3  and  4  are  obtained  after  imposing 
some  structure  (or  model)  on  the  set  of  Fx’s. 

A  very  weak  model  is  that  F .  is  Prohorov  continuous.  That  is 

x,-  — ►  x  =►  Proh(FXi,Fx)  — *•  0, 

so  Yi  ~  FXi  converges  to  Y  ~  Fx  in  distribution.  This  is  a  very  reasonable  assumption  for 
many  applications.  It  says  that  values  of  z,  close  to  z  tend  to  have  Y  distributions  close  to 
the  one  at  z.  Absent  such  an  assumption,  one  would  hardly  use  smoothing  techniques.  Not 
much  is  changed  by  assuming  piecewise  Prohorov  continuity.  For  pointwise  considerations 
all  that  is  needed  is  that  F ,  is  Prohorov  continuous  at  z. 

A  stronger  model  is  that  F .  is  a  location-scale  family  with  location  p(x)  and  scale 
<t(z)  >  0.  That  is 

F“x(u)  =  p(x)  +  <r(z)G-1(u)  (1) 

for  some  distribution  function  G(u).  G  may  be  normalized  to  have  location  0  and  scale 
1,  for  some  location  and  scale  functionals.  The  model  (1)  is  still  fairly  general  and  will 
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be  used  below  to  give  conditions  on  F,  a  more  concrete  appearance.  When  a(x)  >  0  the 
location-scale  model  may  also  be  written 

It  is  interesting  to  note  that  explicit  continuity  assumptions  need  not  be  made  when 
estimating  conditional  moments.  Stone  (1977)  assumes  that  (X,Y)  has  a  distribution  for 
which  £  (  \Yr\  )  <  oo  for  some  r  >  1  and  obtains  global  U  consistency  for  the  regres¬ 
sion  function.  Stone  (1977,  p.641)  explains  that  continuity  assumptions  are  not  needed 
because  the  regression  function,  as  a  function  in  U  can  be  approximated  in  U  norm  by 
a  continuous  function  with  bounded  support  to  within  any  e  >  0.  Devroye  (1981)  obtains 
pointwise  strong  and  weak  consistency  using  the  moment  condition  on  Y . 

Neither  Prohorov  continuity  of  F.  nor  the  existence  of  a  moment  of  Y  is  empirically 
verifiable.  Both  seem  to  be  mild  assumptions. 

The  main  benefit  of  the  continuity  assumption  on  the  conditional  distributions  is  that 
it  becomes  easier  to  handle  non-random  .X’s.  The  same  theorems  will  cover  the  random 
and  the  design  case.  A  second  minor  benefit,  is  that  it  is  possible  to  consistently  estimate  a 
conditional  expectation  in  some  cases  where  £  (  |Y|  )  does  not  exist.  As  a  trivial  example 
suppose  that  the  Xi  are  independent  standard  Cauchy  random  variables  and  that  Y{  —  X{. 
Then  a  uniform  nearest  neighbor  scheme  with  k  =  %/n  provides  pointwise  consistent 
estimates  of  the  regression.  (We  could  even  have  added  some  well-behaved  noise.) 

Continuity  of  F .  will  also  be  considered  in  other  metrics,  such  as  the  KS  metric 
and  the  Vv  metrics.  Some  long  range  conditions  are  also  imposed  on  F,.  Examples  are 
p(Fx,  FXi)  <  B  for  all  *,•  and  p(Fx,  FXi)  <  Mx\x  —  x,  |,  where  p  is  a  statistical  metric.  The 
latter  is  a  local  (M  depends  on  x)  Lipschitz  condition  and  also  imposes  a  short  range 
constraint  on  F,. 

Lemma  2.5.1.  Suppose  the  location-scale  model  (1)  holds  and  /*  and  a  are  continuous 
at  xq.  Then  F.  is  Prohorov  continuous  at  xq.  If  y  =  1R,  ct(xq)  >  0  and  G  is  continuous 
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then  F .  is  KS  continuous  at  xq.  If  y  —  JR  and  G  has  a  finite  p’th  moment,  p  >  1  then  F, 
is  Vp  continuous  at  zo-  If  y  —  1R  and  G  is  bounded  then  F,  is  continuous  at  zo. 

PROOF.  Let  Z  be  a  random  variable  with  distribution  G  and  characteristic  function 
g.  Let  zn  — »  xq  and  denote  p(zj)  by  m,  <r(z,)  by  o,.  Then 

£ eit(»n+anZ)  _  e«^»y(i<rre)  _>  eit>lg(ta)  =  £e*‘(*>+*o3) 

by  continuity  of  g.  This  establishes  the  point-wise  convergence  of  the  characteristic  func¬ 
tion  of  FXn  to  that  of  *■*0  which  implies  Prohorov  convergence  of  FXn  to  FXo  and  hence 
Prohorov  continuity  of  F,  at  xq. 

Suppose  G  is  continuous,  <7q  >  0  and  let  y  €  M.  Then 

since  G  is  continuous  and  l/<r(-)  is  continuous  at  xq.  This  establishes  pointwise  conver¬ 
gence  of  FXn  to  ^0-  Monotonicity  and  boundedness  of  FXn  and  FXo  combine  to  strengthen 
the  result  to  uniform  convergence  by  a  lemma  of  Chung  (1974,  p.  133)  which  is  restated 
in  Sec  3.2.  (That  lemma  also  requires  convergence  of  all  the  jumps,  but  G  has  none.) 
Therefore  F,  is  KS  continuous  at  xq. 

Suppose  G  has  a  finite  p’th  absolute  moment.  By  the  Minkowski  inequality 
vp(FXn,  FZo)  =  (^J  \pn  +  <rnG_1(u)  -  po  -  <r0G-1(ti)|,,du^ 

<  QT1  Iph  -  pol'du) 1/P  +  Qf 1  \&n  -  <roHG"1|'’d«) 1/P 
=  \pn-ti0\  +  \*n-<T0\(e\zn1/p.  (4) 

Therefore  F .  is  Vp  continuous  at  xq. 

If  G  is  bounded 

sup  |pn  +  (rnG~l{u)  -  po  4-  <r0G-1(u)|  <  |pn  -vpo|  +  \<rn  -  <to|  ess  sup  Z 

0<u<l 

so  F .  is  Voo  continuous  at  zq.  I 


i 
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In  view  of  (4)  above,  a  long  range  condition  on  Vp  is  achieved  by  imposing  similar 
conditions  on  p  and  o  in  the  location  scale  family.  Some  authors  implicitly  control  long 
range  behavior  by  working  in  [0, 1]  and  imposing  continuity  on  the  regression.  This  implies 
uniform  continuity  and  also  boundedness.  Similarly,  in  the  location  scale  family  a  Lipshitz 
condition  on  p  and  o  implies  one  on  Vp. 

The  Lipschitz  condition  is  a  fairly  weak  short  range  condition.  Most  results  in  the 
literature  assume  one  or  two  continuous  derivatives  of  p  exist.  Sharper  short  range  con¬ 
ditions  such  as  the  existence  of  derivatives  of  F ,  at  x  will  not  be  considered  here. 

2.6  Compact  Different iablity  and  von  Mises5  Method 

This  section  provides  a  brief  outline  of  compact  or  Hadamard  differentiability  and  of  von 
Mises’  method  for  proving  asymptotic  normality  of  statistical  functionals.  It  will  be  used 
in  Chapter  4  to  prove  asymptotic  normality  for  a  class  of  running  statistical  functionals. 
The  material  in  this  section  is  adapted  from  Fernholz  (1983). 

Suppose  T  is  a  statistical  functional  defined  on  a  convex  set  of  distribution  functions 
that  contains  all  empirical  distributions  and  a  distribution  F,  from  which  a  sample  will 
be  obtained.  Let  G  be  a  member  of  the  convex  set.  The  von  Mises  derivative  T'F  of  T  at 
F  is  defined  by 

Tp(G  -F)  =  ±T{F  +  t(G  -  f ))  |t=o 

so  long  as  there  exists  a  real  function  4>f{x)  (not  depending  on  G)  such  that 

rF(G-F)  =  J  M*)d(G-F)(x). 

This  defines  <f>  up  to  an  additive  constant.  The  derivative  is  normalized  by  taking 

0  =  J  4>F{x)dF{x). 

The  function  4>f{x)  is  better  known  to  statisticians  as  the  influence  function: 

M*)  =  ,F,T)  =  jT  (F  +  t(Sx  -  F))  |t=0. 
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The  quantity  TF{G  —  F)  is  a  linear  approximation  to  T(F)  —  T(G).  When  G  =  Fn 

T(Fn)  -  T(F)  =  Tp(Fn  -  F) 

=  /*,(*)«■„{*) 

=  i'£,IC(Xi;F,T).  (1) 

Since  (1)  is  an  average  of  n  i.i.d.  random  variables  it  (times  y/n)  will  have  a  normal  limit 
provided  the  variance  of  /Cpf,-;  F,T)  is  finite.  Von  Mises’  method  consists  of  establish¬ 
ing  the  normality  of  the  linear  term  and  the  convergence  to  zero  in  probability  of  the 
remainder: 

y/nRem(Fn  —  F)  =  y/n  (T{Fn)  -  T(F)  -  T'F(Fn  -  F))  . 

Strictly,  Rem  should  be  Rem p. 

Now  we  define  the  compact  or  Hadamard  derivative.  For  von  Mises’  method,  the  set 
V  below  is  the  space  of  distributions,  and  W  is  usually  1R. 

Definition.  Let  V  and  W  be  topological  vector  spaces.  A  function  T  from  V  to  W  is 
compactly  differentiable  if  there  is  a  continuous  linear  transformation  TF  from  V  to  W 
such  that  for  any  compact  set  K  C  V 

(-.0  ( 

uniformly  for  H  €  K.  The  linear  transformation  TF  is  the  compact  derivative  of  T  at  F. 

When  the  limit  is  required  to  hold  uniformly  on  any  bounded  set  the  stronger  notion 
of  Frechet  differentiability  results.  When  the  limit  is  only  required  to  hold  pointwise, 
the  weaker  concept  of  Gateaux  differentiability  emerges.  The  Gateaux  derivative  is  very 
similar  to  von  Mises’  derivative.  Whenever  the  compact  derivative  exists  it  coincides 
with  the  Gateaux.  Frechet  differentiability  is  strong  enough  that  the  remainder  term 
y/nRem{Fn  —  F)  ►  0  in  pr.,  if  T  has  a  Frechet  derivative  at  F.  Unfortunately,  Frechet 
differentiability  is  too  strong  to  be  applicable  to  most  statistical  functionals.  For  example 
the  median  is  not  Frechet  differentiable  at  the  uniform  distribution  on  (0,1).  The  Gateaux 
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derivative  is  weak  enough  that  most  statistical  functionals  of  interest  are  differentiable. 
Gateaux  differentiability  is  not  enough  to  gaurantee  that  the  remainder  term  converges 
to  0.  The  compact  derivative  was  shown  by  Reeds  (1976)  to  be  strong  enough,  that 
its  existence  forces  the  remainder  term  to  0  in  probability.  It  is  also  weak  enough  that 
it  applies  to  many  statistical  functionals.  For  examples  see  Reeds  (1976)  and  Fernholz 
(1983). 

In  Chapter  4  von  Mises’  method  is  used  for  conditionally  estimated  statistical  func¬ 
tionals  T(FX).  It  is  shown  there  that  existence  of  the  compact  derivative  together  with  a 
Brownian  limit  for  the  empirical  process  y/n^{Fx  —  Fx)  and  a  further  mild  condition  on  the 
weights  is  sufficient  for  the  remainder  term  y/n^Rem(Fx  —  Fx)  to  converge  in  probability 


to  zero. 


3  Consistency 


This  chapter  considers  consistency  of  F .  for  F,  and  of  T(F.)  for  T(F,).  We  will  consider 
pointwise  consistency,  i.e.  the  convergence  of  Fx  to  Fx  for  fixed  x  G  X .  Pointwise  consis¬ 
tency  of  T(F,)  for  T(F ,)  follows  for  continuous  T.  Prohorov  (weak),  Kolmogorov-Smirnov 
and  Vasserstein  consistency  of  Fx  are  treated. 

3.1  Introduction  and  Definitions 

Consistency  of  Fx  for  Fx  has  two  aspects  to  it:  how  the  distance  between  Fx  and  Fx  is 
to  be  measured  and  the  nature  of  the  convergence  of  the  (random)  distance  so  measured, 
to  zero.  Possibility  for  confusion  arises  because  common  ways  of  expressing  the  distance 
between  two  distributions  have  probabilistic  interpretations  in  terms  of  variables  with 
those  distributions.  For  example  convergence  of  C(Zn)  to  C(Z)  in  the  Prohorov  metric  is 
equivalent  to  weak  convergence  of  Zn  to  Z.  When  the  distance  itself  is  studied  as  a  random 
variable  it  may  be  converging  weakly,  or  strongly  or  in  IS.  If  the  metric  converges  weakly 
then  its  probability  law  is  converging  in  the  Prohorov  metric  to  that  of  a  point-mass  at 
zero.  For  clarity,  the  metric  interpretation  will  be  used  for  the  distance  between  Fx  and 
Fx  and  the  usual  probabilistic  concepts  will  be  used  for  the  distance  between  the  metric 
and  0. 

Let  U  be  a  metric  space  containing  G  and  the  sequence  Gn,  and  let  p  be  its  metric. 
Definition  Gn  is  strongly  U -consistent  for  G  if  p(Gn,G)  — ►  0  a.s.  as  n  — >  oo. 
Definition  G„  is  weakly  U -consistent  for  G  if  p(Gn,  G)  — ►  0  in  pr.  as  n  — *  oo. 
Definition  Gn  is  U -consistent  in  IS,  p  >  1  for  G  if  £  (  p(Gn,G)p  )  — ►  0  as  n  — ►  oo. 
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Either  strong  or  IP  U -consistency  implies  weak  17-consistency.  Neither  strong  nor 
IP  ^-consistency  implies  the  other  without  added  conditions  such  as  boundedness  of  the 
metric. 

When  the  set  of  distributions  in  U  is  understood  17-consistency  may  be  referred  to 
as  /^consistency  where  p  is  the  metric.  Write  Gn  —*  G  p  in  pr.,  Gn  —>  G  p  a.s.,  and 
Gn  — »  Gp  IP  for  weak  strong  and  IP  consistency  of  the  sequence  Gn  for  G. 

We  will  obtain  strong  and  weak  p  consistency  of  Fx  for  Fx  where  x  €  X  is  fixed.  Such 
consistency  is  called  pointwise  consistency. 

Two  alternatives  to  pointwise  consistency  are  global  consistency  and  uniform  consis¬ 
tency.  Global  consistency  is  the  convergence  of  p(Fx,  Fx)  to  zero  where  X  is  a  random 
variable  independent  of  the  data  and  X,  Xi,X2, . . .  are  i.i.d.  Global  IP  consistency  was 
considered  by  Stone  (1977)  for  several  functionals  with  Fx  obtained  by  nearest  neighbor 
methods.  In  his  discussion  of  Stone’s  paper,  Bickel  (1977)  remarks  that  the  pointwise 
notions  of  convergence  would  seem  to  be  more  important  from  a  practical  point  of  view. 
Weak  or  strong  pointwise  consistency  established  at  almost  all  x  G  X  implies  global 
consistency  of  the  corresponding  type.  The  implication  does  not  hold  for  pointwise  IP 
consistency  without  some  other  condition  such  as  a  bound  for  the  pointwise  IP  errors  that 
can  be  integrated  with  respect  to  £(X).  Global  consistency  does  not  apply  to  the  design 
case. 

Uniform  consistency  is  said  to  hold  when  for  any  compact  K  C  X 

sup  p(Px,Fx) 

x€K 

converges  to  0.  Weak  or  strong  uniform  consistency  is  of  course  stronger  than  the  corre¬ 
sponding  pointwise  concept. 

Several  pointwise  consistency  results  are  proved  below  for  Fx.  Weak  and  strong  point- 
‘wise  consistency  is  inherited  by  continuous  functionals. 

Lemma  3.1.1  Let  T  be  a  function  from  the  metric  space  U  to  the  metric  space  V 
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A 

that  is  continuous  at  Fx  €  U.  If  Fx  is  strongly  (weakly)  U—  consistent  for  Fx  then 
T(FX)  -  T(FX)  a.s.  (in  pr.). 

PROOF.  For  strong  consistency  the  proof  follows  by  using  the  continuity  of  T  on  the 
set  of  probability  1  for  which  Fx  converges  to  Fx.  Let  e  >  0.  For  weak  consistency  the 
probability  that  T(FX)  is  within  an  e-ball  of  T(FX)  is  no  less  than  the  probability  that  Fx 
is  within  some  5-ball  of  Fx  by  the  continuity  of  T  and  the  latter  probability  converges  to 
1  by  the  consistency  of  Fx.  ■ 

For  weak  consistency,  Lemma  3.1.1  is  a  special  case  of  the  continuous  mapping  the¬ 
orem  (see  Billingsley  (1968)  or  Pollard  (1984)).  The  general  result  has  convergence  in 
distribution  where  the  above  has  convergence  in  probability  to  a  constant.  The  general 
version  of  continuity  at  the  limit  is  continuity  with  probability  1  at  the  (random)  limit. 

IF  consistency  of  Fx  and  continuity  of  T  at  Fx  does  not  imply  IF  consistency  of  T. 
A  further  condition,  such  as  Lipschitz  continuity  of  T,  is  needed. 

Lemma  3.1.1  asserts  the  pointwise  consistency  of  T(FX)  for  T(FX).  Its  two  conditions 
are  consistency  of  Fx  and  continuity  of  T.  Continuity  of  statistical  functionals  with  respect 
to  statistical  metrics  is  discussed  in  Sec.  2.4.  The  next  three  sections  give  sufficient 
conditions  for  the  consistency  of  Fx  in  the  Prohorov,  Kolmogorov-Smirnov,  and  Vasserstein 
metrics,  in  that  order.  The  conditions  are  expressed  in  terms  of  the  nature  of  the  continuity 
of  jP.,  the  convergence  of  the  weight  measure  Wx  to  5Z  in  an  appropriate  metric  and  the 
rate  at  which  the  effective  local  sample  size  nx  becomes  infinite. 

3.2  Prohorov  Consistency  of  Fx 

In  this  section  Prohorov  continuity  of  F .  and  some  regularity  conditions  on  the  set  of 
weights  are  used  to  establish  pointwise  weak  and  strong  Prohorov  consistency  of  Fx. 

The  Prohorov  metric  for  finite  measures  is  given  in  Sec.  2.4.  Convergence  of  this 
metric  is  equivalent  to  weak  convergence  of  the  measures.  Weak  convergence  of  finite 
signed  measures  is,  except  for  trivial  exceptions,  not  metrizable.  Sec.  2.4  defines  a  metric 
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Proh  that  is  stronger  than  weak  convergence,  on  finite  signed  measures.  In  this  section 
the  weights  regarded  as  a  finite  signed  measure  on  X  are  required  to  converge  in  the  metric 
Proh  to  a  pointmass  at  a  target  point  x.  This  is  a  shorthand  way  of  saying  that  the  sum 
of  the  negative  weights  converges  to  zero  and  that  for  any  open  set  in  X  the  sum  of  the 
positive  weights  attached  to  that  set  converges  to  1  or  0  according  to  whether  x  is  or  is 
not  in  the  set.  At  the  end  of  this  section,  more  general  conditions  are  given  that  imply 
weak  convergence  of  Fx  to  FX)  in  pr.  and  a.s.  Under  these  more  general  conditions,  the 
weight  functions  can  have  a  nonzero  limiting  sum  of  negative  weights. 

Throughout  this  section  X  and  y  are  complete  separable  metric  spaces.  The  next 
theorem  does  most  of  the  work  for  Prohorov  consistency  of  Fx. 

Theorem  3.2.1  Let  <p  be  a  bounded  measureable  function  that  is  continuous  on  a  set  of 
Fx  probability  1.  Then  under  conditions  i)  and  ii)  below 

I<p(y)dFx(y)  -+  /  <p(y)dFx(y)  in  pr. 
and  under  conditions  i)  and  iii)  below 

/  <p(y)dFx(y)  —  /  <p(y)dFx{y)  a.s. 


t)  F,  is  Prohorov  continuous  at  z 

ii)  Wx  -*  Sx  Proh  in  pr.  and  nx  — ►  oo  in  pr. 

iii)  Wx  — ►  8X  Proh  a.s.  and  nx/  log  n  — ►  oo  a.s. 

PROOF.  Define 


<P  =  f  <p(y)dFx(y)  and  <Pi  =  J  <p{y)dFX{(y), 


let  B  =  supy  |y>(y)|  and  e  >  0.  Then 


J<p(y)dFx(y)  -  I <p{y)dFx(y)  =  -<p)~  p(l  -  £V.-)- 


«=i 


«=i 


(1) 
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The  second  term  in  (1)  converges  to  0  weakly  under  ii)  and  strongly  under  iii). 

By  the  continuous  mapping  theorem  (Billingsley  1968,  Sec.  1.5)  there  is  an  open  set 
A  C  X ,  with  x  €  A  such  that  x,-  G  A  implies  j '<pi  —  Tp\<€.  The  first  term  from  (1)  may 
now  be  written 

-9)  =  £  H'.WO  -  9)  +  £  W‘Myi)  -  V).  (2) 

*,£A 

The  second  term  in  (2)  converges  to  0  weakly  under  ii)  and  strongly  under  iii)  because 

\<p{Yi)  -<p\<  2 B. 

Let  \W\  =  Conditionally  on  the  X’s  the  first  term  in  (2)  has  expectation 

bounded  in  absolute  value  by  2B\W\e  and  variance  bounded  by  4 B2/nx.  If  |W|  <  2  and 
nx  >  4B 2/es  then  by  Chebychev’s  inequality  the  conditional  probability  that 

J2wiMYi)-r)  >3e 

i.eA 

is  less  than 

iB*/{4B>/<>) 

(3e  -  2e)*  ~ 

It  follows  that  the  unconditional  probability 

p(|  ]C  w*WXi)  -  p)\  >  3e)  <  -PC"*  ^  452  A3)  +  P{\W\  >  2)  +  e  -  e 

*,€  A 

by  ii).  This  establishes  the  first  result  of  the  theorem. 

Turning  to  strong  convergence,  condition  on  a  sequence  of  X  values  satisfying 

nx/\o%n  — +  00  and  Proh(Wx,6x )  — ►  0.  (3) 

Such  sequences  have  probability  1  under  iii).  Conditionally  on  the  X’s  the  quantities 


Wi  {<p(Yi)  -  <Pi) 
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are  independent,  have  expectation  0  and  are  bounded  in  absolute  value  by  |W,-|£.  Using 
HoefFding’s  inequality  (see  for  example  Pollard  (1984,  Appendix  B)), 


x,€A 

<P(  E  \W<MY<) -Wi)l+  E  \w‘&‘  -  P)l  >  (i  +  |W|)«) 

x,6A  x.eA 

<P(  E  -  W,)l  >  «) 

*,€  A 

<  exp(-2c2/4 B2 

=  exp(-n,£,B-,/2) 

<n-<3B“3n,/21ogn 

<n~2 

for  large  enough  n  by  (3). 

Because  (4)  sums  we  conclude  that  the  conditional  probability 


« 


P(  E  M*>M)  -  *01  >  (i  + |w|)«  <  o.  I  Jf)  =  o  (5) 

x.€A 

by  the  Borel-Cantelli  lemma.  Since  (3)  implies  | W|  — ►  1  a.s.  by  iii),  we  may  replace 
(1  +  \W\)e  by  3e  in  (5).  Since  (5)  holds  for  a  set  of  sequences  X  with  probability  1,  by 
Fubini’s  theorem 

E  I WMYi)  -  <p)  |  >  3c  i.o.)  =  0.  I 

x,6A 

Theorem  3.2.1  holds  for  any  complete  separable  metric  spaces  X  and  y .  The  main 
applications  are  to  Euclidean  spaces,  but  also  covered  are  the  unit  circle  and  sphere  (for 
periodic  or  directional  data)  the  space  of  continuous  functions  on  a  compact  interval  with 
metric  induced  by  the  sup  norm,  and  the  space  of  infinite  real  sequences  with  metric 
induced  by  the  sup  norm. 

The  condition  n,/logn  — ►  oo  a.s.  can  be  replaced  by  the  slightly  sharper,  but  less 
evocative 

y%xp(-nge)  — »  0  a.s.  Ve  >  0. 
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Theorem  3.2.1  is  enough  to  prove  consistency  for  many  functionals  that  can  be  an¬ 
alyzed  in  terms  of  a  finite  number  of  v^O)’3-  For  example,  an  M  estimate  of  location 
generated  by  a  bounded  continuous  monotone  ip  function,  with  a  unique  value  at  Fx  must 
be  consistent  because  the  root  based  on  the  positive  part  of  Fx  is  consistent  and  the  neg¬ 
ative  part  becomes  too  small  to  change  the  root  by  much.  Note  that  for  signed  measures 
Fx  the  M  estimate  will  not  necessarily  have  a  unique  value,  but  the  smallest  and  largest 
values  will  be  consistent. 

Theorem  3.2.2  Under  conditions  i)  and  ii)  of  Theorem  3.2.1 

Proh(FX)  Fx)  — >  0  in  pr. 

and  under  conditions  i)  and  iii)  of  Theorem  3.2.1 

Proh(Fx,  Fx)  —*■  0  a.s. 

PROOF.  Let  0  represent  the  zero  measure.  Since  Fx  is  a  probability  measure 
Proh(Fx,  Fx)  =  Proh(F+,Fx )  +  Proh{Fx  ,  0) 

=  Proh(F+,Fx)  +  F~(y) 

—  Proh(Fx  ,FX)  +  W~  (X) 

- Proh{F+,Fx ) 

weakly  under  i)  and  ii)  and  strongly  under  i)  and  iii). 

Also 

KW)  =  w+(X)  i 

weakly  under  i)  and  ii)  and  strongly  under  i)  and  iii)  so  by  Lemma  2.4.1  it  suffices  to 
prove  convergence  for  x{F+,  Fx). 

Let  €  >  0.  Because  y  is  a  complete  separable  metric  space,  and  Fx  is  a  probability 
measure  there  are  disjoint  sets  .  .,Br  with  Fx(dB})  =  0,  Fx(Bq)  <  e/4  and  for 

j  >  1  Bj  has  diameter  less  than  e,  that  is  Bj  C  {y}£  whenever  y  E  Bj.  Note  that 


3.2  Prohorov  Consistency  of  Fx  52 


the  indicators  of  the  sets  Bj  are  Fx- a.e.  continuous  bounded  measureable  functions  so 
Theorem  3.2.1  applies  to  them. 

Suppose  that  x(F+,  Fx)  >  e.  Then  there  is  a  set  A  C  y  such  that 

Ft{A)>  Fx{A*)  +  e  (6) 

where  Ae  is  defined  by  equation  2.4.2.  Let  Ay  =  An  Bj  be  a  partition  of  A.  The  inequality 
(6)  only  happens  when  either 

-  r.W)  >  e/2  m 

or  for  some  j  >  1 

f’+Uy)  -  Fx(Aej)  >  «/2r.  (8) 

But  F+(Aq)  <  F+(Bo)  — »  Fx(Ba)  <  c/4  with  weak  convergence  under  i)  and  ii)  and 
strong  convergence  under  i)  and  iii).  Therefore  the  probability  of  the  event  (7)  converges 
to  0  under  i)  and  ii)  and  the  probability  that  (7)  happens  infinitely  often  is  0  under  i)  and 
iii).  As  for  (8) 

f.+(Ay)  -  F.(A'j  <  P?{Bj)  -  F,(A') 

<  F:(Bs)  - 

so  (8)  can  occur  only  if 

*?(%)  -  UBi)  >  </2r 

and  as  before  this  event  has  probability  tending  to  0  under  i)  and  ii),  and  zero  probability 
of  infinite  occurrence  under  i)  and  iii).  I 

For  y  =  JR  the  strong  result  above  can  be  obtained  horn  strong  convergence  of  the 
Fx  probabilities  for  an  appropriate  countable  set  of  intervals  to  the  corresponding  Fx 
probabilities. 

Corollary  If  T  is  a  statistical  functional  that  is  robust  at  Fx  and  the  W{  are  probability 
weights,  then  under  i)  and  ii)  of  Theorem  3.2.1 


T(FX )  — ►  T(FX )  in  pr. 
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and  under  i)  and  iii)  of  Theorem  3.2.1 

T(FX)  —*  T(FX)  a.s. 

PROOF.  Because  T  is  robust  at  Fx,  it  is  continuous  at  Fx  on  the  space  of  probability 
distributions  on  1/  under  the  Prohorov  metric,  by  Hampel’s  theorem.  The  result  then 
follows  from  Theorem  3.2.2  and  Lemma  3.1.1. 

To  obtain  consistency  of  running  robust  functionals  when  negative  weights  are  used 
it  suffices  to  show  that  the  functionals  are  still  continuous  when  extended  to  finite  signed 
measures. 

The  mean  is  not  a  Prohorov  continuous  functional,  so  the  Prohorov  consistency  the¬ 
orem  does  not  yield  a  consistency  proof  for  the  regression.  The  mean  is  Prohorov  con¬ 
tinuous  on  the  space  of  distribution  functions  that  satisfy  /  \Y\1+sdF(Y)  <  B  for  some 
S  >  0,  B  <  oo.  This  follows  for  example  from  Theorem  4.5.2  of  Chung  (1974).  Assuming 
that  sup xf  \Y\1+sdFx(Y)  <  B  is  not  quite  enough,  since  a  bound  has  to  hold  on  the 
sequence  Fx. 

Under  the  assumption  that  |Y|  <  B  <  oo,  consistency  of  the  regression  function  is 
now  easy  to  obtain. 

Theorem  3.2.3  Let  m(F)  =  /  ydF(y)  and  assume  |Y|  <  B  <  oo.  If  conditions  i)  and  ii) 
of  Theorem  3.2.1  hold  then 

m(Fx)  —+  m(Fx)  in  pr. 

and  under  conditions  i)  and  iii) 

m(Fx)  — *  rn(Fx)  a.s. 

PROOF.  Use  <p(Yi)  =  y.  Because  |Y|  <  B  Theorem  3.2.1  applies.  I 

Devroye  (1981)  obtains  strong  pointwise  consistency  for  the  regression  function  as¬ 
suming  bounded  Y.  His  conditions  on  the  weights  are  slightly  stronger  than  those  above, 
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(he  uses  probability  weights  and  imposes  a  stronger  condition  on  the  largest  of  them)  but 
he  does  not  place  the  Prohorov  continuity  condition  on  the  conditional  distribution  of 
Y .  He  obtains  weak  pointwise  convergence  without  using  bounded  Y ,  for  nearest  neigbor 
weights  (that  are  exactly  0  for  all  but  a  vanishingly  small  fraction  of  the  observations)  and 
for  a  restricted  class  of  kernel  estimates.  Devroye  (1982)  extends  the  regression  consistency 
results  and  obtains  some  sufficient  conditions  under  the  bounded  Y  assumption. 

The  theorems  above  need  slight  modification  to  apply  to  weight  schemes,  includ¬ 
ing  many  kernel  methods,  that  have  asymptotically  non-negligible  negative  weights.  For 
Proh(Wx,8x)  to  vanish,  the  sum  of  the  negative  weights  has  to  go  to  0.  More  typically 
there  is  a  constant  b  €  (0,  oo)  such  that 


Proh  ( W~,b8x )  ->0 

(9a) 

Proh  (W+,  (1  +  &)£*)  -+  0. 

(96) 

Then  Proh(Wx,8x)  — »  2b  >  0.  The  conclusions  of  Theorem  3.2.1  still  hold  when  in  pr. 
and  a.s.  versions  of  (9ab)  are  used.  Theorem  3.2.2  won’t  hold  because  F~(y)  — *  b.  The 
essence  of  Theorem  3.2.2  is  that  Fx  — ►  Fx  in  the  sense  of  weak  convergence,  and  that  result 
can  be  generalized. 

Let  0  be  the  set  of  open  sets  in  the  topology  of  weak  convergence. 

Definition  Fx  —*  Fx  weakly  in  pr.  if  Fx  G  O  €  0  implies 

lim  P  ( Fx  €  o)  =  1. 

«— ►  OO  \  / 

Definition  Fx  — >  Fx  weakly  a.s.  if  Fx  €  O  €  0  implies 

P  (  lim  Fx  €  O)  =  1. 

\n—*oo  J 

The  following  theorem  employs  a  sequence  of  nonnegative  random  variables 


bn  =  bn(X  1,.. .,*„)• 
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Useful  possibilities  are  bn  =  |W“|  and  bn  =  b  =  /  K~(v)dv  for  a  kernel  function  K. 

Theorem  3.2.4  Let  <p  be  a  bounded  measureable  function  that  is  continuous  on  a  set  of 
Fx  probability  1.  Then  under  conditions  i)  and  ii)  below 

/  <p(y)dFx(y)  -»  /  <p{y)dFx(y)  in  pr.  and  Fx  -»  Fx  weakly  in  pr. 

Under  conditions  i)  and  iii)  below 

/  <p(y)dFx( y)  — ►  /  <p(y)dFx(y)  a.s.  and  Fx  — ►  Fx  weakly  a.s. 

»)  F.  is  Prohorov  continuous  at  x 

ii)  There  exist  nonnegative  r.v.s  bn(Xi, . . . , Xn)  such  that: 

Proh  (W+,  ( 1  +  bn)Sx)  -*■  0  in  pr. 

Proh  (W~,bn6s)  — *  0  in  pr. 

V«  >  0  3Bt  <  oo  with  lim  sup  P(bn  >  Bt)  <  e 
n*  — *•  oo  in  pr. 

iii)  There  exist  nonnegative  r.v.s  6n(Xi, . . . , Xn)  such  that: 

Proh  (Wx  t  (1  +  bn)Sx)  — >  0  a.s. 

Proh  (W~ ,  bnSs)  — *  0  a.s. 

3 B  <  oo  with  P(limsup6n  >  B)  =  0 
n*  — ►  oo  a.s. 

PROOF.  For  <p  bounded,  measureable  and  continuous  a.e.  [Fx\  write 
I  / <p(y)dK{y)  - 1 <p(y)dFx{y)\ 

<  \I<p(y)dF?(y)  -  (1  +  bn)  /  <p(y)dFx(y) I  +  I  fp(y)dF~(y)  -  bn  f  <p(y)dFx(y) |.  (10) 

The  proof  of  Theorem  3.2.1  can  be  adapted  to  show  that  both  terms  in  (10)  converge  to 
0,  in  pr.  under  i)  and  ii)  and  a.s.  under  i)  and  iii).  The  bounding  conditions  on  bn  are 
used  in  the  Chebychev  and  Hoeffding  arguments  applied  to  the  first  term  in  (2). 
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Let  O  €  0  contain  Fx.  Then  there  is  a  basic  open  set  M  such  that  Fx  €  M  C  O.  The 
neighborhood  base  at  Fx  in  the  topology  of  weak  convergence  consists  of  sets  of  the  form 

nj»i {G  :  |  /  <pj{ y)dG(y)  -  /  (pj(y)dFx(y)\  <  e  j  =  1, . . . ,  k} 

for  nonnegative  integers  k,  positive  e  and  bounded  continuous  functions  <pj.  It  follows 
from  the  convergence  of  (10)  to  0  for  any  finite  set  of  bounded  continuous  functions  ipj 
that  Fx  —*  Fx  weakly  in  pr.  under  i)  and  ii)  and  weakly  a.s.  under  i)  and  iii).  ■ 

Perhaps  the  conditions  above  can  be  further  weakened  to  weak  convergence  of  Wx 
to  Sx,  in  pr.  and  a.s.  Such  generality  is  not  needed  in  most  smoothing  applications. 
Consider  the  following  example:  Let  r,  be  an  enumeration  of  a  countable  dense  subset  of 
X.  Let 

>n+ =  «.+!>■%  “d  *7  =  !>■%. 

*  i 

where  d(r»,t<n)  <  1/n.  Then  Wx  — *  Sx  weakly,  but  does  not  satisfy  the  conditions  of  The¬ 
orem  3.2.4.  For  applications,  it  is  reasonable  to  assume  that  \WX\  — *■  0  on  the  complement 
of  any  open  set  containing  x.  This  was  used  to  handle  the  second  term  in  (2). 

3.3  KS  Consistency  of  Fx 

This  section  provides  sufficient  conditions  for  KS(FX,  Fx)  to  vanish.  The  result  is  similar 
to  that  for  the  Prohorov  metric  except  that  the  Prohorov  continuity  condition  on  F .  is 
strengthened  to  KS  continuity.  Fortunately  it  is  not  necessary  to  strengthen  the  Prohorov 
convergence  of  Wx  to  KS  convergence,  since  the  latter  only  holds  together  with  nx  —*  oo 
when  the  number  of  x ,•  equal  to  i  grows  without  bound.  The  Kolmogorov-Smirnov  metric 
is  stronger  than  the  Prohorov  metric  so  that  convergence  in  the  former  implies  convergence 
in  the  latter. 

Any  functional  T  that  is  continuous  when  the  Prohorov  metric  is  used  on  its  domain 
is  also  continuous  when  the  KS  metric  is  used.  Functionals  such  as  JV{F)  =  F{y)  —  F(y—) 
are  continuous  when  the  KS  metric  is  used  on  the  distributions,  but  may  not  be  when  the 
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Prohorov  metric  is  used.  Therefore  the  KS  consistency  results  of  this  section  are  useful  in 
situations  where  there  are  atoms  in  the  distribution  of  Y . 

First  we  note  for  later  use: 

Lemma  3.3.1  Let  F  be  a  distribution  function  on  272.  Let  J  be  the  set  of  points  of  jump 
of  F  and  let  Q  be  the  set  of  rational  numbers.  If 

Fn(y)  -  F(y)  VyeQ 

and 

Fn(y)  ~  Fn{y-)  -  F(y)  -  F(y~)  Vy  €  J 

then  KS(Fn,F)  — ►  0. 

PROOF.  This  is  proved  in  Chung  (1974,  p.133). 

Lemma  3.3.2  Let  yo  G  1/  =272.  Under  conditions  i)  and  ii)  below 

K{yo )  -*•  ^(yo)  in  pr. 

and  under  conditions  i)  and  iii)  below 

Fx{yo)  — *  Fx(yo)  a-s. 


*)  F,  KS  continuous  at  x 

»*)  Wx  —*■  Sx  Proh  in  pr.  and  n,  — »  oo  in  pr. 

iii)  Wx  — *•  Sx  Proh  a.s.  and  nx/  log  n  -+  oo  in  pr. 

Definition  A  sequence  will  be  said  to  converge  appropriately  if  it  converges  weakly  under 
conditions  i)  and  ii)  and  strongly  under  conditions  i)  and  iii). 

PROOF.  Let  <p(Yi)  =  ly,<Vo •  If  yo  is  a  continuity  point  of  Fx  then  <p  satisfies  the 
conditions  of  Theorem  3.2.1  and  so  Fg( yo)  converges  appropriately  to  Fx(y o).  If  yo  is  not  a 
continuity  point  of  Fx  then  <p(-)  though  bounded  and  measureable,  fails  to  be  continuous 


a.e.  [Pr]. 
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In  the  proof  of  Theorem  3.2.1  the  a.s.  continuity  of  <p(-)  was  only  used  to  establish  the 
existence  of  an  open  set  A  9  x  such  that  x+  €  A  implies  |  /  <p(y)dFX{(y)  —  J  <p{y)dFx[y)\  < 
f.  But  KS  continuity  of  F,  at  x  gaurantees  the  existence  of  such  a  set  and  so  with 
this  modification  we  can  establish  the  appropriate  convergence  of  ^(yo)  to  Fx{ yo)  as  in 
Theorem  3.2.1.  I 

Lemma  3.3.3  Let  yo  €  y  =  1R.  Under  conditions  i)  and  ii)  of  Lemma  3.3.2 

Fx(yo)  ~  K(yo-)  -*•  ^*(yo)  -  i^(yo-)  in  pr. 
and  under  conditions  i)  and  iii)  of  Lemma  3.3.2 

-F’z(yo)  -  Fx(yo~)  -*  Fx( yo)  -  Fx(y o— )  a.s. 

PROOF.  Proceed  as  in  the  proof  of  Lemma  3.3.2.  Let  <p(Yi)  =  1  y,=V0'  ^  yo  is  not  an 
atom  of  Fx  then  the  proof  follows  from  Theorem  3.2.1.  If  y0  is  an  atom,  KS  continuity  of 
F ,  implies  that  the  open  set  A  required  in  Theorem  3.2.1  for  <p(-)  exists.  In  either  case 
Fx(yo)  -  Fx(yo~)  converges  appropriately  to  F*(yo)  -  Fx( yo~). 

Theorem  3.3.1  Let  y  =1R.  Under  conditions  i)  and  ii)  of  Lemma  3.3.2 

Fx  —*■  Fx  KS  in  pr. 

and  under  conditions  i)  and  iii)  of  Lemma  3.3.2 

FS—*FX  KS  a.s. 


PROOF.  Define 


I'M  = 


ft  to) 

»up  ,rtto)  w‘ft) 


Now 


KS{Fx,Fx)  <  KS(F?,FX)  +  KS{F~,  0) 

<  KS{FX,  Fx)  +  KS{F?X)  +  KS(F~,  0). 


(1) 
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The  third  term  in  (1)  is  bounded  by  \WX  (T)|  and  the  second  term  is  bounded  by 

both  of  which  converge  appropriately  to  0. 

Write 

K(y)  =  E^1^v 

where  W,  =  W+ /  Y2j  •  The  weights  Wt-  satisfy  condition  ii)  when  the  W{  do  and 
similarly  for  condition  iii). 

For  strong  convergence,  apply  Lemma  3.3.2  with  weights  Wi  at  points  in  the  set  Q  of 
rational  numbers  and  Lemma  3.3.3  to  all  the  points  in  J,  the  set  of  points  of  jump  of  Fx. 
(The  set  Q  U  J  is  countable.)  Then  except  on  the  union  of  a  countable  number  of  null  sets 
(which  is  again  a  null  set) 

K(y)-Fx(y)  Vy  €Q 

and 

K(y)  -  K(y~)  -  Fx{ y)  -  Fs{ y-)  Vy  e  J. 

Therefore  with  probability  1 

KS(FX,FX)^  0 

by  Lemma  3.3.1. 

For  weak  convergence,  let  e  >  0.  Select  a  finite  grid 

-oo  =  yo  <  yi  <  •  •  Vr-i  <  yr  =  oo 

such  that  the  Fx  probability  of  each  open  interval  (y;- ,  y3+i),  0  <  j  <  r  is  less  than  e. 
(The  grid  contains  any  atoms  of  Fx  that  are  greater  than  e.)  By  Lemmas  3.3.2  and  3.3.3 
F'xiVj)  Fx(yj)  in  pr.  and  'Fx(y])-Fx(yj~)  -*  Fx{y,)-  Fx(yj~)  in  pr.  at  each  y,-.  There¬ 
fore  KS(FX,  Fx)  — ♦  0  in  pr.  by  a  standard  multi-e  argument  that  uses  the  monotonicity  of 
Fx  and  Fx.  ■ 
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Although  Theorem  3.3.1  assumes  the  KS  continuity  of  F,  at  x,  it  was  only  used 
to  get  the  continuity  of  the  running  probabilities  F,(yo)  and  jumps  F.(yo)  —  F,( yo~)  at 
x.  Nowhere  was  the  uniform  continuity  of  these,  quantities  that  KS  continuity  imposes 
explicitly  used.  Yet  it  follows  from  Lemma  3.3.1  that  the  continuity  of  the  jumps  and 
probabilities  at  x  implies  the  KS  continuity  of  F .  at  x. 

The  results  of  this  section  were  obtained  under  the  assumption  that  Proh(Wx,6x)  — ♦  0. 
This  can  be  weakened  to  accomodate  weights  that  have  asymptotically  nonnegligible  neg¬ 
ative  components.  The  conditions  of  Theorem  3.2.4  are  adequate.  The  proof  of  Theorem 
3.3.1  must  be  modified  slightly:  consider  KS(F+,  (1  +  bn)Fx)  and  KS(F~,bnFx). 

It  should  be  possible  to  extend  the  results  of  this  section  to  Glivenko-Cantelli  classes 
of  sets  in  lRd. 

3.4  Vasserstein  Consistency  of  Fx 

Recall  that  Fx  — »  Fx  in  the  Vasserstein  metric  Vp  iff  Fx  — ►  Fx  in  the  Prohorov  metric  and 
/  \Y\pdFx  — »  /  \Y\pdFx.  The  main  reason  to  consider  these  metrics  is  to  study  the  cor¬ 
responding  moments,  particularly  the  regression  function.  The  maun  advantages  to  using 
the  Vasserstein  metric  instead  of  a  direct  method,  is  that  with  the  bias-variance  decompo¬ 
sition  based  on  the  Y*,  the  bias  term  is  conveniently  handled.  The  triangle  inequality  for 
metrics  can  be  used  to  split  the  problem  into  bias  and  variance  parts  and  some  conditions 
on  the  weights  can  be  expressed  naturally  in  terms  of  Vasserstein  distances. 

The  results  for  strong  convergence  are  not  as  sharp  as  those  obtainable  by  direct 
arguments  based  on  the  regression  function.  The  sharpest  available  results  appear  to  be 
those  of  Zhao  and  Fang  (1985)  and  Zhao  and  Bai  (1984).  The  paper  by  Zhao  and  Fang 
considers  what  are  essentially  uniform  kernels  and  obtains  strong  global  consistency.  The 
paper  by  Zhao  and  Bai  considers  a  very  general  family  of  nearest  neighbor  methods  and 
obtains  strong  pointwise  consistency.  They  exploit  an  asymptotic  equitability  constraint 
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that  in  the  language  of  Chapter  2  is 

sup  max  nxW{  <  oo 

n  * 

for  their  probability  weights.  If  ns  is  the  effective  sample  size,  then  no  observation  should 
get  too  great  a  multiple  of  the  “fair  share”  1/n*.  (Of  course  most  observations  get 
an  infinitesimal  fraction  of  1  /nx.)  Both  papers  consider  the  sampling  case  and  assume 
i  \Y\P  <  oo  for  some  p  >  1.  Most  other  published  works  use  at  least  a  finite  second 
moment  for  Y .  The  indirect  results  given  here  use  “off  the  shelf”  laws  of  large  numbers 
for  triangular  arrays.  For  strong  convergence  one  can  do  better  by  exploiting  relationships 
between  the  rows  of  the  arrays. 

It  follows  from  straightforward  analysis  that  Vp(Fx,  Fx)  — ►  0  in  pr.  iff 
Proh(Fx,  Fx)  -+  0  in  pr.  and  /  \Y\pdFx  —  /  \Y\pdFx  in  pr. 

By  considering  the  fixed  points  in  the  sample  space,  Vp[Fx,  Fx)  — »  0  a.s.  iff 
Proh(Fx,Fx)  and  / \Y\pdFx  -  /  \Y\pdFx  -*  0  a.s. 

For  Vasserstein  consistency,  the  conditions  for  Prohorov  consistency  are  strengthened. 
Assuming  Prohorov  consistency,  the  weak  or  strong  consistency  of  Vp  is  equivalent  to  the 
weak  or  strong  consistency  of  the  p’th  absolute  moment. 

The  bias-variance  split  is 

VP(FX,  Fx)  <  Vp(Fx,  F‘)  +  VP(FX,  Fx) 

where 

=  (i) 

The  bias  term  will  be  handled  by  direct  consideration  of  Vp(Fx,  F*).  For  the  variance  term 
it  is  easier  to  work  with  /  \Y\pdFx  —  / \Y\pdFx. 

We  consider  first  the  variance  term.  For  strong  convergence  of  the  variance  term, 
we  need  strong  convergence  for  certain  row  sums  of  random  variables  in  a  triangular 
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array.  This  is  more  difficult  to  obtain  than  strong  convergence  of  sample  means  and  more 
moments  are  assumed.  The  source  for  most  of  the  results  on  strong  convergence  including 
parts  (i)  through  (iv)  of  the  next  lemma  is  Stout  (1969).  That  reference  has  very  sharp 
laws  of  large  numbers  for  triangular  arrays,  under  conditions  that  are  much  more  general 
than  required  here. 

Lemma  3.4.1  Let  Wnk  be  fixed  real  numbers  for  1  <  k  <  n  <  oo  and  let  Dk  be  i.i.d. 
random  variables  with  £  (  Dk  )  =  0.  Set  n-1  =  an<^  Tn  =  ]Cfc=i  WnkDk.  Then 

Tn  — *  0  a.s.  if  any  of  the  following  sets  of  conditions  holds: 

(i)  \Wnk\  <  Bn~a,  and  n*/logn— *■  oo,  and  <  oo 

(ii)  \Wnk\  <  Bn~a,  and  n2/logn-*oo,  and  £\Dk\2+x^a  <  oo 

(iii)  \Wnk\<Bn~*,  and  nt  >  and  £\Dk\2+x  <  oo 

(iv)  |W„*|  <  BA:-1/2,  and  nx  >  Bn“,  and  £\Dk\2  <  oo, 
where  B>0,  t>0,  A>0  and  a  €  (0, 1)  axe  constants. 

PROOF.  Note  that  — ►  oo  implies  ^exP(— tn*)  <  oo ,Vf  >  0.  Stout  (1969)  uses 

the  latter  condition. 

Part  (i)  follows  from  Stout’s  Corollary  1,  which  is  derived  from  his  Theorem  l(i)  with 
(3  =  1  —  a.  Part  (ii)  follows  from  Stout’s  Theorem  l(i)  with  fl  =  at.  Part  (iii)  follows  from 
Stout’s  Theorem  l(i)  with  fi  =  a(l  +  A)  —  1.  Part  (iv)  follows  from  Stout’s  Theorem  2.  I 

Conditions  (i)  and  (ii)  place  the  mildest  restrictions  on  the  growth  of  nx.  For  a  >  1/2 
(i)  is  preferred  to  (ii)  and  the  reverse  holds  for  a  <  1/2.  When  stronger  conditions  are 
placed  on  the  growth  of  n*  a  better  tradeoff  between  the  bound  on  |W,-|  and  the  number  of 
moments  required  of  Dk  can  be  obtained  via  (iii).  Part  (iv)  is  unusual  in  that  the  bound 
is  not  on  the  maximum  weight  in  a  row,  but  in  the  maximum  weight  ever  placed  on  a 
given  Dk.  It  allows  a  milder  moment  condition. 

Definition  A  sequence  Z,  converges  completely  to  0  if  Ve  >  0 

OO 

J^P(\Zi\>e)<oo. 

i=  1 
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Complete  convergence  implies  a.s.  convergence  by  the  Borel-Cantelli  lemma  and  is  in 
fact  strictly  stronger  that  a.s.  convergence.  Under  conditions  (i)  through  (iii)  complete 
convergence  to  0  is  obtained. 

If  the  moment  conditions  in  Lemma  3.4.1  are  suitably  strengthened,  the  Dk  do  not 
need  to  be  identically  distributed. 

Lemma  3.4.1'  Let  Wnk  be  fixed  real  numbers  for  1  <  k  <  n  <  oo  and  let  Dt  be  indepen¬ 
dent  random  variables  with  £  (  Dk)  =  0.  Set  n~l  =  and  Tn  =  ^nkDk- 

Then  Tn  — ►  0  a.s.  if  any  of  the  following  sets  of  conditions  holds: 

(i)  \Wnk\  <  Bn~a,  and  n*/logn  -*•  oo,  and  £  (  |Z>t|2/a(log+  |Z>7b|)1"Hr7  )  <  S 

(ii)  \Wnk\  <  Bn~a,  and  nx/ logn  -»•  oo,  and  £  (  |£>jk|J+1/a(log+  |L»i|)1+’7  )  <  B 

(iii)  |W„*|  <  Bn~a,  and  nx  >  Bn1_“A,  and  £  (  |£»fc|2+A(log+  \Dk\)1+TI  )  <  B 

where  B  >  0,  A  >  0,  r)  >  0,  and  a  G  (0, 1)  are  constants. 

PROOF.  Items  i-iii  follow  from  Stout’s  Theorem  4  in  the  same  way  that  the  corre¬ 
sponding  parts  of  Lemma  3.4.1  do  from  Stout’s  Theorem  3.  ■ 

Stout  does  not  provide  a  version  of  his  Theorem  2  for  the  non  identically  distributed 
case,  so  there  is  no  Lemma  3.4.l'(iv). 

The  following  technical  lemma  from  Chung  (1974)  is  used  for  weak  convergence  of  the 
variance  term. 

Lemma  3.4.2  Let  {6nj,  1  <  j  <  fcn)  be  a  double  array  of  complex  numbers  such  that  as 
n  — »  oo: 


max  \dnj\  —  0  (2a) 

bn 

2>n/|<  M<oo  (2b) 

1=1 
kn 

X]  9 

1=1 


(2c) 
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where  $  is  a  finite  complex  number.  Then 

IK1  +  e$- 

»= l 

PROOF.  Chung  (1974,  p.  199) 

Corollary  Let  {0nj ,  1  <  j  <  fcn}  be  a  double  array  of  complex  random  variables  such 
that  as  n  — ►  oo: 


m ax  |0„,|  -»  0  in  pr. 

l<J<fcn 

fcn 

(3a) 

p(E  i**i  s  -  * 

1=1 

fcn 

(36) 

^2  *«j  “♦  9  in  Pr- 

(3c) 

1=1 


where  $  is  a  finite  complex  number  and  M  <  oo  is  a  constant.  Then 

fcn 

rid  +  6nj)  -*•  ee  in  pr. 

i'=i 

PROOF.  Let  e  >  0.  By  Lemma  3.4.2  there  exists  8  >  0  such  that  max  |0nj-|  <  8 

and  l^nji  <  M  and  |  Qnj  —  0\  <  8  together  imply 

fcn 

nd + - e*  <  *• 

i=i 

Therefore 


The  next  two  lemmas  establish  weak  and  strong  convergence  of  the  variance  term  in 
(1),  assuming  the  weak  and  strong  (respectively)  convergence  of  Proh(F£,  Fx). 

Lemma  3.4.3  For  p  >  1  suppose  that 

Mpd)  =  J  |v|p<W*(y)  <  oo 
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and  let  W{  be  weights  satisfying 


nz  — »  oo  in  pr. 

n 

E Wi  i in  pr- 

i=i 

p(E|w,|<b)-i 

•=i 

for  some  fixed  B  <  oo.  Then 

n 

E  w<  l*iT  “♦  M1)  in  Pr- 

«=1 

Proof.  Let 

z,-  =  wr  -  /iP(x). 


(4a) 

(4i) 

(4c) 


Then 

E  Wt  |  Y?\p  -  /!,(*)  =  E  Wi*  -  /ip(x)  (l-E  Wi) .  (5) 

t=i  »=i  «=i 

The  second  term  in  (5)  converges  to  0  in  pr.  by  (4b). 

Let  g  be  the  characteristic  function  of  the  Z,-.  Then 

=  i  HgitWj)  (6) 

i 

and  it  suffices  to  show  that  (6)  converges  to  1.  In  fact,  because  the  integrand  in  (6)  is 
bounded,  it  suffices  to  show 

n^.^inp,  (7) 

j 

For  t  =  0,  (7)  is  trivial;  suppose  t  ^  0.  Because  £  (Z,)  =  0,  and  all  the  Z,-  have  the 
same  characteristic  function  g, 

g(tWi)  =  1  +  9nj 

where  for  any  e  >  0,  there  is  a  S  >  0  such  that 


1 9nj |  <  €  \tWj\  whenever  max  \Wj\  <  8. 


(8) 
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Also  n*  — ►  oo  in  pr.  implies  that  maxy  |Wy|  — >  0  in  pr.  and  hence 

max  |0ny|  — *  0  in  pr.  (9) 

From  (8)  and  (9),  with  M  >  Bf  jtj 

<  m) i  (io) 

3 

and  finally  (10)  and  (9)  and  (8)  imply 

53  9nj  -+  0  in  pr.  (11) 

3 

By  the  Corollary  to  Lemma  3.4.2,  with  0  =  0,  (7)  follows  from  (9),  (10)  and  (11).  ■ 

Lemma  3.4.4  For  p  >  1  and  i.i.d.  Y*  ~  Fx,  suppose  that  |Y^Z|P  satisfies  one  of  the 
moment  conditions  in  Lemma  3.4.1(i-iv)  and  that  W(  satisfy  the  corresponding  condition 
a.s.  Suppose  also  that  the  Wt-  are  independent  of  the  |Y^Z|P  and  satisfy  the  further  condition 

la-s.  (12) 

Then 

J2Wi ly<*lp  Mp(»)  =  J  \y\p^x{y)  as- 

PROOF.  Let  Di  =  \Y?\p  -  f  ]y| *dFx{y).  Then 

£ Wi\Y?\*  =  Y,WiDi  -  n,(x)(l  -  ][><)•  (13) 

The  second  term  in  (13)  converges  to  0  a.s.  by  (12).  Whichever  moment  condition 
from  Lemma  3.4.1  is  satisfied  by  |l^-z|p,  it  is  also  satisfied  by  Di.  The  D,  are  i.i.d.  with 
mean  zero.  This  also  holds  conditionally  on  the  Wi,  by  independence. 

Condition  on  W{  =  to,-  that  satisfy  the  requirements  of  Lemma  3.4.1.  Then  — + 

0  a.s.  By  Fubini’s  theorem,  we  can  remove  the  conditioning  and  so  J2  WiD,  — *  0  a.s.  | 

The  bias  term  is  Vp(Fx,  Fx)  =  (^2  W,\Yi  —  yvtJp)1/p.  Conditionally  on  X,  the  mean  of 
Vpp  is 

J2wiVP(FXi,Fxy 
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and  the  variance  is  bounded  by 

To  control  the  bias  term,  conditions  governing  the  behavior  of  VP(FX.,  Fx)  as  a  function  of 
Xi  can  be  traded  off  against  conditions  governing  how  the  weight  measure  Wx  converges 
to  Sx.  If  X  is  compact  and  Vp(FXi,  Fx)  is  a  continuous  function  of  x ,•  then  it  is  bounded. 
The  boundedness  of  Vp(FXi,Fx)  allows  relatively  weak  conditions  to  be  imposed  on  Wx. 
At  the  other  end  of  the  spectrum,  convergence  of  Wx  allows  weak  conditions  to  be 
placed  on  Vp(Fs.,  Fx). 

Mack  and  Silverman  (1982)  assume  a  uniform  (in  x)  bound  on  /  |y|2d.Pa(y),  which 
they  describe  as  a  mild  condition.  (They  establish  uniform  convergence  of  the  regression 
over  suitable  bounded  intervals.)  This  is  weaker  than  the  boundedness  of  Y  that  Devroye 
(1981)  uses  which  as  they  point  out  does  not  even  allow  the  usual  normal  linear  model. 
Their  condition  does  not  allow  ( X ,  Y)  to  be  bivariate  normal  with  nonzero  correlation.  A 
uniform  bound  on  /  \y\2dFx(y)  implies  a  uniform  bound  on  Fx). 

lemma  3.4.6  places  conditions  on  Vp(FXi,  Fx),  such  as 

Vp(FXi,  Fx)  <  Af.flx,-  -  x\  +  \xt  -  x\*) 

for  a  >  1.  The  first  term  dominates  for  x,-  near  x,  where  most  of  the  observations  are 
asymptotically,  and  the  second  regulates  the  long  range  behavior  of  the  model  F ,.  Recall 
(Sec.  2.5)  that  for  a  location-scale  family 

=  M(*)  +  <^(*)^_1(«).  «  €  (0, 1) 

the  following  bound  holds: 

Vp{FXi,  Fx)  <  |/a(xj)  -  /i(x)|  -I-  |<r(x,)  -  cr(x)|  (/  |-F-1(u)|pdu)1/p . 

It  follows  that  in  a  location-scale  family  conditions  on  the  conditional  location  and 
scale  imply  similar  conditions  on  Vp.  A  range  of  conditions  relating  Vp(Fxt,  Fx)  to  ||x'  —  x|] 
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is  considered,  and  the  weaker  the  Vp(Fx>,  Fx)  condition  is,  the  stronger  the  condition 
imposed  on  Wx  must  be. 

The  next  lemma  is  used  in  the  proof  of  Lemma  3.4.6(ii)  and  is  used  several  times  in 
Chapter  4. 

Lemma  3.4.5  Let  X  and  Y  be  random  variables  with  i  (  |y|  |  X  )  <  oo  a.s.  and  let 
e  >  0.  Then 

P(|y|>e)<e  +  P(5(|r||X)>c2). 

PROOF. 

P(|y|>e)  =  f  (P(|y|>e|X)) 

<  f  (  elp(|y|>c|x)<<  +  lp(|y|>t| x)>e  ) 

=  e  +  P{P(\Y\>e\X)>e) 

<c+p(±£{\Y\\X)>^ 

Lemma  3.4.6  Let  Wx  be  probability  weights.  Assume  that  F.  is  Vp  continuous  at  x  and 
that  Wx  —*  Sx  Proh  in  pr.  Then 

VP(F*,FX)  — ♦  0  in  pr. 

if  any  of  the  conditions  below  hold: 

(i)  Vp(FXi,Fx)  <  B 

(ii)  Vp(FXi,  Fs )  <  M,  max{||*  -  s,-|| ,  ||z  -  x<||a}  and  Wx  -*  Sx  Vap  in  pr. 

(iii)  VP(FX{,  Fx)  <  <f>{xi  -  i)  and  -  x)p  -»  0  in  pr. 

(iv)  V<»  (Wx,  Sx)  -*•  0  in  pr. 

where  B  >  0,  p  >  1,  a  >  1,  Mx  auid  a  €  (0, 1)  are  constants. 

PROOF.  For  any  e  >  0  there  is  a  radius  8  >  0  such  that 
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Let  Sg  =  {u  €  X  :  ||t>  -  x||  <  5}.  Denote  by  Wx[Sg)  the  sum  of  the  weights  corresponding 
to  Xi  €  Sg.  £  (  W*(Sj)  )  — ►  1  since  WZ(S$)  is  uniformly  bounded  by  1  and  converges  to  1 
in  probability. 

For  case  (i) 

£  (  VP(F*,  Fxy  )  =  £  (  XV.Iy;1  -  Yi\*  ) 

=  £(J2wivp(FXi,Fxy) 

<  s  (wx(ssy  +  wx(sf)B p) 

—  ep 

Therefore  VP(F£,  Fx)  — *•  0  in  IF  and  hence  also  in  pr. 

For  case  (ii) 

P  (Vp(>*,  Fx)p  >  e)  <  e  +  P  (f  (  Vp(P|,  Fx)p  \  X  )  >  e2) 

=  e+p(22Wi£(\YiX-Yi\P\X)>*2) 

=  t+p(22wivp(Fxi,Fzy>c2) 

<<+p  -  *np + 11*  -  *m  >  <2) 

=  e  +  p  (j MZ(VP(WX ,  6xy  +  Vap(wx,  Sxyp)  >  e2) 

— ♦  c. 

The  proof  of  (iii)  is  essentially  the  same  as  the  one  for  (ii). 

For  case  (iv) 

p  ( vp(f Fxy  >*)<p(Yl  wi\ Y'  ~  yiip  > e/2)  +  p{  £  W*\Y*  ~  Y<\p  >  £/2) 

st-€5j 

<  P(  £  W*\Y*  ~  yi'lP  >  £/2)  +  P(V°o{Wx,  Sx)  >  S ) 

*i€S< 

<  \i  (  £  W*\y iX  -  Y;IP)  +  P{V<»{W;  fit)  >  &) 

<  -eep  +  p(yoo(Wx,5x)>6) 

—  2ep_1 


Therefore  Vp(P* ,  Px)  — ►  0  in  pr.  | 
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For  strong  convergence  of  the  bias  term  there  is  the  possibility  of  combining  any  of 
the  strong  laws  for  non-identically  distributed  random  variables  with  any  of  the  tradeoffs 
between  regularity  of  F,  and  convergence  of  Wx.  Instead  of  producing  a  lemma  with  some 
twelve  parts,  we  select  part  (iii)  of  Lemma  3.4.2',  and  strong  versions  of  parts  (i)  and  (iv) 
of  Lemma  3.4.6. 

Lemma  3.4.7  Let  Wx  be  probability  weights.  Assume  that  F,  is  Vp  continuous,  that 
Wx  — ►  Sx  Proh  a.s.,  nx  >  Sn1-0lA  a.s.  and  max|W,-|  <  Bn~a  a.s.. 

Then 

Vp{H,Fx)-+  0a.s. 

if  either  of  the  following  hold: 

(0  V„(F",F,)<B 

(ii)  F,  is  continuous  at  x  and  5X)  — ►  0  a.s. 

where  B  >  0,  p  >  1,  A>0,  7  >  2  +  A  and  a  €  (0, 1)  are  constants. 

Remark  The  variable  7  is  introduced  to  simplify  the  exposition.  A  uniform  bound  on 
£  (  {Dkl^  )  implies  a  uniform  bound  on  £  (  (|J3fc|p)2+Alog+(|jDt|p)1+'7  )  for  any  7  >  0. 
The  latter  condition  is  the  one  used  in  Lemma  3.4.1*. 

PROOF.  For  any  e  >  0  there  is  a  radius  6  >  0  such  that 

||*»  —  x||  <  $  Vp(FSj.,F’s)  <  e.  (14) 

When  F9  is  VVr  continuous  at  x  there  is  a  radius  S  such  that 

||xt-  -  x||  <  $  =>■  Vp^{FX{,  Fx)  <  e.  (15) 

Let  Sg  =  {v  €  X  :  jjv  —  x||  <  5}.  Denote  by  Wx(Sg)  the  sum  of  the  weights  corresponding 
to  Xi  €  S{.  WX(S{)  — *  1  a.s. 

In  either  case  condition  on  X  values  that  satisfy  the  a.s.  conditions  on  the  W,-.  Strong 


conditional  convergence  is  sufficient  by  Fubini’s  theorem. 


A 
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For  case  (i),  pick  S  >  0  to  satisfy  (14).  Then 

■F  E'W? -w- »?('«• '•)) 

+  E  ^  if 

*.€5i 

+  E 

the  first  term  of  which  coverges  to  zero  a.s.  by  Lemma  3.4.1'(iii).  The  second  term  is 
bounded  by  ep  and  the  third  by  BpWx(Sg)  — ►  0. 

In  (ii)  pick  6  to  satisfy  (15).  Then 

vpp{f;,  K)  =  E  ~  Y<\p  ~  VZ(F*<> F*» 

*.€5j 

+  E  W‘VHF*»F*) 

Xi€Sf 

+  E  wi\Y*~Yi\P 

.  *i$Sf 

the  last  term  of  which  is  eventually  zero  with  probability  1.  The  second  term  is  bounded 
by  ep,  and  the  first  term  satisfies  the  conditions  of  Lemma  3.4.1'(iii).  I 

The  results  for  weak  convergence  may  be  summarized  as  follows: 

Theorem  3.4.1  Suppose  for  some  finite  p  >  1,  that  F,\aVp  continuous  at  x  and  that  Wx 
is  obtained  from  probability  weights  with 

Wx  —*  8sProh  in  pr.  and  n*  — »  oo  in  pr. 

Then 

Fx  — *  Fx  Vp  in  pr. 

under  any  of  the  conditions  below: 

(i)  Vp(Fti,Fx)<B 

(ii)  Vp(Fx.,  Fx)  <  Mx  max{||z  -  i,||  ,  ||z  -  a:* ||a)  and  Wx  ->  8X  Vap  in  pr. 

(iii)  Vp{FXi,  Fx)  <  -  x)  and  E  wit(x<  ~  x)p  0  in  pr; 

(iv)  V0O(Wxi  8X)  -*  0  in  pr. 
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where  B  >  0,  Mx  >  0  and  a  >  1  are  constants  and  ^  is  a  nonnegative  real  function. 
PROOF.  By  the  triangle  inequality 

VpiK  Fx)  <  Vp(Fx,  FI)  +  VP(F*,  Fx)  (16) 

where  F£  is  defined  by  (l).  By  Lemma  3.4.6,  Vp(F*,  Fx)  — >  0  in  pr.  and  from  ns  — ►  oo 
follows  Proh(F*,Fx )  — ►  0  in  pr.  (Apply  Theorem  3.2.2  with  every  x,  =  x.)  Also  by 
Lemma  3.4.3,  X)  — *■  Mp(x)  in  pr.  Therefore  VP(F£,  Fx)  — ►  0  in  pr.  ■ 

The  results  for  strong  convergence  may  be  summarized  as  follows: 

Theorem  3.4.2  Suppose  for  some  finite  p  >  1,  that  F»  is  Vp  continuous  at  x,  that  Fx  has 
a  finite  2  +  A’th  absolute  moment  for  some  A  >  0  and  that  Wx  is  obtained  from  probability 
weights  with 


Wx  — ►  6X  Proh  a.s.  and  n„  >  Bn1  aX  a.s.|Wt- 1  <  Bn  a  a.s. 


for  a  €  (0, 1).  Then 


Fx  — »  Fx  Vp  a.s. 


under  either  of  the  conditions  below: 

(0  v„(r«,r,)<B 

(ii)  F,  is  Vjn  continuous  at  x  and  Voo(JVs,$x)  — ►  0  a.s. 
where  7  >  2  +  A. 

PROOF.  Decompose  VP(FX>  Fx)  into  bias  and  variance  components  as  in  Theorem 
3.4.1.  The  bias  term  VP(FX ,  Fx)  —*  0  a.s.  by  Lemma  3.4.7.  By  Lemma  3.4.4 
pp(x)  a.s.  using  condition  (iii)  of  Lemma  3.4.1  and  the  independence  of  the  Y?  and  the  W{. 
Also  Proh(F£,Fx)  — ♦  0  a.s.  by  Theorem  3.2.2,  so  that  the  variance  term  VP{F£,FX)  — > 


0  a.s.  ■ 


4  Asymptotic  Normality 


4.1  Introduction 

In  Chapter  3,  weak  and  strong  consistency  of  running  functionals  was  obtained.  In  this 
chapter,  many  running  functionals  turn  out  to  be  asymptotically  normal.  As  for  the 
estimate  Fx,  it  converged  to  Fx  weakly  or  strongly  (depending  on  the  strength  of  the 
conditions)  in  several  metrics,  in  Chapter  3.  In  this  chapter  conditions  are  given  under 
which  the  normalized  difference  y/n^(Fx  —  Fx)  converges  weakly  to  a  Brownian  bridge. 
Unifying  features  of  the  two  chapters  are  that  the  same  bias-variance  split  is  used  and  the 
effective  sample  size  nx  plays  a  role  analogous  to  that  played  by  n  in  the  i.i.d.  setup.  The 
result  is  to  refine  the  notion  that  estimation  at  x  is  like  that  based  on  a  biased  sample  of 
size  nx  from  Fx. 

The  development  is  as  follows:  The  estimated  regression  function  is  split  into  bias 
and  variance  terms.  Sec.  4.2  develops  necessary  and  sufficient  conditions  for  the  variance 
term  to  have  a  normal  limit.  A  multivariate  central  limit  theorem  follows  immediately  by 
the  Cramer- Wold  device.  Sec.  4.3  provides  conditions  under  which  the  bias  term  goes  to 
zero  fast  enough  that  the  regression  itself  is  asymptotically  normal.  The  variance  term  of 
Fx  converges  weakly  to  a  Brownian  bridge  under  conditions  given  in  Sec.  4.4,  and  under 
further  conditions  the  bias  term  converges  to  zero.  Von  Mises  method  and  the  theory  of 
compact  differentiability  prove  asymptotic  normality  for  a  class  of  running  functionals  in 
Sec.  4.5. 

The  bias  variance  split  for  Fx  is 


Fx-Fx  =  (FZ  -  Fx)  +  (Fx  -  FI) 


(1) 
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where  Fx  is  obtained  by  substituting  Yx  for  YJ  in  Fx  and  the  split  for  a  functional  T(-)  is 
T(FX)  -  T(FX)  =  {T(F*)  -  T{FX))  +  (T(FX)  -  T(F *))  (2) 

which  for  the  conditional  expectation  becomes 

m(x)  -  m(x)  =  YiW^Y’  ~  »>(*))  +  'EW‘<Y<  -  r’>-  (3) 

The  second  term  is  named  after  the  bias  because  it  is  nonzero  due  to  the  discrepancy 
between  Fx  and  Fx,  and  the  first  term  is  named  after  the  variance  because  it  is  nonzero 
due  to  sampling  variation  from  Fx. 

4.2  Asymptotic  Normality  of  the  Regression  Variance 


The  variance  term  in  4.1.3  is  a  weighted  sum  of  centered  Y^s.  The  quantities  Yix-m(x )  are 
i.i.d.  with  mean  0,  and  we  will  assume,  a  finite  variance.  There  is  no  essential  difference  in 
the  treatment  of  Y?  and  h(Yx)  provided  h(Y{x)  satisfies  the  moment  conditions.  Therefore 
it  will  make  the  notation  clearer  to  replace  Yx  or  h{Yx )  by  V*  where  the  Vi  are  i.i.d.  and 
have  first  and  second  moments.  By  construction  (Sec.  2.1)  the  Yx's  are  independent  of 
the  -X,  ’s  and  hence  of  the  W,’s. 
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and 


max  y/n^\W„i\  -*•  0. 
l<«<n 


(2) 


PROOF.  Necessity  of  (1)  and  of  (2)  is  trivial.  For  sufficiency  there  is  no  loss  of 
generality  in  taking  p  =  0  and  <r2  =  1.  To  conform  with  our  usual  notation,  abbreviate 
Wni  to  Wt. 

The  proof  begins  by  applying  the  Lindeberg  theorem  (Billingsley  1979,  Theorem  27.2) 
to  the  double  array  with  n,t  element  yfn^WiVi.  We  need  only  establish  Lindeberg’s 
condition  which  here  amounts  to  showing 


±L 


nxW?V?dF  —  0 


^  JWr&ViVifrr, 

for  any  r;  >  0. 

Put  W  =  max  |W,-|  in  each  row  of  the  table.  Then  the  sum  in  (3)  does  not  exceed 

f\ntWf  [  V?dF 

=  f  VfdF 

J\J nprvl\>n 


(3) 


<-L 


V?dF 


'  \vi  I  >n 

where  [z\  denotes  the  largest  integer  less  than  or  equal  to  z. 
The  sequence  in  (4)  tends  to  zero  in  pr.  if 


(4) 


L 


V?dF  -*■  0. 


(5) 


Note  that  (5)  is  the  Lindeberg  condition  for  y/n  times  the  sample  average  of  n  i.i.d.  VJ 
which  has  a  normal  limit.  Since  W  y/n^  — >  0 


lim  max  P(\y/n^WiVi\  >  e)  =  0  (6) 

n— »oo  l<<<n 

for  any  e  >  0.  Then  (6)  and  Feller’s  theorem  (Billingsley,  1979,  Theorem  27.4)  together 
imply  (5).  I 
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Lemma  4.2.1  can  also  be' proved  using  characteristic  functions.  Because  the  Vi  are 
i.i.d.  the  “little  o”  terms  all  come  from  the  same  Taylor  approximation  and  so  their  sum 
is  easy  to  manage. 

Corollary  The  condition  Wni  =  1  may  be  replaced  by 


V^(l  "  S>m)  -  0 


(7) 


t=i 


in  Lemma  4.2.1. 

PROOF.  Immediate. 

Since  Wz  is  random,  it  is  essential  to  extend  the  conditions  of  Lemm  4.2.1. 

Lemma  4.2.2  Let  W„i,  1  <  •  <  n  <  oo  be  a  triangular  array  of  real  random  variables, 
and  set  ^ 

n*  =  n*(n)=^W^  • 

Let  VJ  be  i.i.d.  from  a  distribution  F  with  mean  p  and  positive  variance  er2  <  oo.  Also 
assume  that  the  Vj-  are  independent  of  the  .  Then 


=  W*Vi  ~t*J  ”  N(°>  »*) 


if  all  of  the  following  hold: 


n,->oom  pr. 

max  \fni\Wni\  — +  0  in  pr. 
l<«<n  '  ' 

ft 

V”*(l  -  5>->  -*  0  in  pr- 
«=x 


Remark  For  probability  weights  (8a)  implies  (8bc). 

PROOF.  As  before  abbreviate  Wni  by  W<.  Make  the  split 


(8a) 

(86) 

(8c) 


zn  =  v^E^(^  -  m)  - 


(9) 
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The  second  term  in  (9)  tends  to  zero  in  pr.  by  (8c).  We  may  assume  that  >  0  by 

(8ac)  and  so  dividing  each  Wi  by  £2  Wt-  yields  weights  that  sum  to  1  without  changing  the 
first  term  in  (9).  Therefore  we  may  assume  that  ^  W{  =  1. 

Let  z  €  IR  and  e  >  0.  If  the  Wi  were  fixed,  then  by  Lemma  4.2.1  there  would  exist 
S  >  0  such  that  nx  >  1/8  and  y/n^  max  |W,-|  <  8  imply  that  |  P{Zn  <  z)  —  $>(z)|  <  e  where 
$  is  the  standard  normal  distribution  function.  But  by  independence  of  the  Wi  and  Vi, 
the  conditional  distribution  of  Zn  given  values  of  the  W, ’s  is  exactly  what  it  would  be  for 
fixed  Wj’s  taking  those  values.  Therefore 

I P(Zn  <Z)~  *(z) |  <  1 1 P{Zn  <  z  I  Wlt . . . ,  Wn)  -  *{z)\ 

<  e  +  P(nx  <  1/8)  +  P[\/n^ max  |W,-|  >  5) 

— ►  e. 

Therefore  P(Zn  <  z)  —y  $(z).  I 

Lemma  4.2.2  extends  to  a  multivariate  central  limit  theorem  as  follows: 

Lemma  4.2.3  Let  Vi  be  i.i.d.  random  vectors  of  length  p  with  mean  p  and  variance- 
covariance  matrix  S.  Let  W„i  satisfy  (8abc).  Assume  that  the  Vi  are  independent  of  the 
W{.  Then 

Zn  d=  W*V*  ~  /*)  *  Nr(°>  S)- 

PROOF.  Let  /  be  any  fixed  p-vector.  The  asymptotic  distribution  of  /  •  Zn  is  normal 
with  limiting  first  two  moments  0  and  I'Ll1  by  Lemma  4.4.2.  Since  this  holds  for  any  /  the 
asymptotic  distribution  of  Zn  is  multivariate  normal  with  mean  0  and  variance-covariance 
S.  (See  Rao  1973,  2c.5iv).  I 

4.3  Asymptotic  Negligibility  of  the  Regression  Bias 

In  this  section  we  provide  conditions  under  which 

v^J2wi{Yi-Yn^o. 

With  the  factor  y/n^,  the  variance  term  converges  to  a  normal  distribution  with  mean  0 
and  variance  t  (y  —  m(x))2.  To  make  the  bias  converge,  we  require  Wx  to  converge  to  8X 
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in  some  sense.  For  Wx  — ►  Sx  to  imply  that  the  bias  disappears  it  is  necessary  to  suppose 
that  when  x,-  is  close  to  x,  that  FXi  is  suitably  close  to  Fx.  Typically  one  assumes  that  the 
regression  curve  admits  so  many  continuous  derivatives  and  applies  a  Taylor  expansion. 
Here,  that  condition  is  replaced  by  an  assumption  that 

Vi(FXi,  Fx)  <  Mx  ||x,-  -  x|| 

at  least  for  x,-  close  enough  to  x.  In  the  presence  of  Prohorov  continuity  of  F,  the  condition 
above  is  weaker  than  the  existence  of  a  derivative  of  m(x).  To  make  the  normalized  bias 
converge,  it  will  be  necessary  to  have  Wx  converge  to  Sx  faster  in  some  sense  than  nx 
is  going  to  infinity.  In  practice,  one  usually  tolerates  some  asymptotic  bias,  in  order  to 
obtain  a  lower  mean  square  error. 

Lemma  4.3.1  If  F.  satisfies 

Vi(FXi,Fx)<Mx\\Xi-x\\  (1) 

and 

i  (  V^i(Wx,Sx))-*0  (2) 

then 

Proof. 

c  ( -  yi‘)l )  S  f  ( V^EI^IW  -  Y’l ) 

=  t  (  vsjEwiw-  ) 

S  £  (  V^E  \W‘\M*  ll*i  -  *11  ) 

=  e(y/n;M*V1(W„6,)') 

—  0.  ■ 

Condition  (2)  says  that  the  weighted  average  absolute  distance  of  the  observations 
used  to  estimate  the  regression  from  the  target  point  must  go  to  zero  faster  than  the 


4.3  Asymptotic  Negligibility  of  the  Regression  Bias  79 


reciprocal  of  the  square  root  of  the  effective  sample  size.  For  fc-NN,  the  fc’th  neighbor 
should  be  at  distance  o(l/fc)  from  x.  In  the  sampling  case  the  fc’th  neighbor  is  usually  at 
distance  Op(k/n)  from  the  target  point.  Because  condition  (2)  involves  the  expectation 
of  V\{WX,  <5r)  it  may  be  awkward  when  the  X,-  are  sampled  from  a  long-tailed  distribu¬ 
tion.  For  the  next  lemma  (2)  is  weakened  to  convergence  in  pr.,  and  the  conclusion  is 
correspondingly  weaker,  but  is  enough  to  give  the  regression  an  asymptotically  normal 
distribution. 

Lemma  4.3.2  If  F,  satisfies 

Vi(FXi,Ft)<Mx\\Xi-x\\ 

and  if 

\/n^Vi(W*,  $*)  ->  0  in  pr.  (3) 

then 

~  YiX)  0  in  Pr- 

PROOF.  Let  e  >  0,  and  put 

b= 

Then,  using  X  to  denote  the  sequence  of  X,’s,  and  recalling  Lemma  3.4.5: 

P[B  >  e)  <e+P{£{B\X)>  (?) 

=  e  +  P  (f(v^E  -  Y*\  I  X)  >  «*) 

=  (  +  p  (v^E  \w*\*i\Yi  -  Y*\ !  x)  >  *2) 

=  e  +  P  (^E  \Wi\vi(Fx<,  Ft)  >  e2) 

<  e  +  P  (v^E U*  -  *H  >  «*) 

=  e  +  P  W^MXVX{WX,  8X)  >  c2) 
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Condition  (2)  is  that  the  area  between  the  distribution  curves  is  locally  Lipschitz. 
This  is  a  mild  short  range  condition,  but  it  does  have  long  range  consequences.  Most 
authors  handle  the  long  range  problem  by  either  working  in  a  compact  set,  or  by  using 
a  Wx  that  has  Vx,  convergence  to  Sx.  (Examples  are  kernels  with  bounded  support, 
and  k-NN  schemes.)  With  V »  convergence  of  Wx  it  is  only  necessary  to  assume  that 
Vi(FX{,  Fx)  <  Mx  ||x,-  —  x||  for  sufficiently  small  ||x,-  —  x||.  With  compact  X  and  Prohorov 
continuous  F .,  continuity  of  m(-)  implies  condition  (2). 

Lemma  4.3.3  For  some  positive  D  <  oo,  suppose  F,  satisfies 

V\[FXi,  Fx)  <  -M*  ||x<  -  x||  whenever  ||x<  -  x||  <  D. 

Assume  that  P[nx  >  1)  — ►  1  and 

VWoo(Wx,  Sx)  ->  0  in  pr.,  (3) 

and  for  some  positive  E  <  oo 

^(131^1^®)-"°-  (4) 

Then 

°  °- 

PROOF.  Let  e  >  0  and  define 

H  =  {nx>  1}  n  Q 2  \Wi\  <E}  n  {Vo 0{WXt  6X)  <  D) 

and 

Then 

P{B  >  e)  <  P(B1H  >  e)  +  P{He) 

<e  +  P{£  (Bljj  |  X)  >  e2)  +  P{He ) 
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<  e  +  P  W^M'EVn (Wx,  Sx)  >  e2)  +  P(He) 

-*e  +  P(He) 

<  e  +  P(£  \Wi\  >E)  +  P{nx  <  1  or  V*,^,  Sx)  >  D) 

-  €  +  P(nx  <  1  or  Vo 0(WX,  Sx)  >  D) 

<  e  +  P(nx  <  1)  +  P(Voo(W.,«.)  >  D  ic  nx  >  1) 

-  e  +  P(Voo(PVx,  6X)  >  D  it  nx>  1) 

<  c  +  P(v^Voo(WX)  Sx)  >  D  ii  nx>  1) 

<e  +  P(^V00(Wx,Sx)>  D) 

-  c  I 

Condition  (4)  is  introduced  because  of  the  way  V00(PVX,  Sx)  is  extended  to  finite  signed 
measures  Wx  in  Subsec.  2.4.3.  For  bias  elimination,  very  light  conditions  are  placed  on 
the  sequence  nx.  For  example  1-NN  schemes,  in  which  the  closest  neighbor  to  x  gets  unit 
weight  and  all  other  observations  get  0  weight  satisfy  the  lemmas  above.  The  condition 
governing  |W«|  is  important  in  the  bias  considerations,  but  was  not  needed  to  handle 
the  variance  term  in  Sec.  4.2. 

Now,  combining  the  results  of  this  section  and  Sec.  4.2: 


Theorem  4.3.1  If  for  some  positive  B  <  oo  Wx  satisfies: 

nx  — >  oo  in  pr.  (5a) 

max  -v/n^Wjl  — »  0  in  pr.  (56) 

1  <i<n  v  1 

n 

y/n^{l  -  Wi)  0  in  Pr-  (5c) 

»=1 

V^zVi(lVz,  Sx)  — +  0  in  pr.  (5 d) 

P{J2\Wi\  <  E)  ^  1  (5e) 

and  for  some  positive  Mx  <  oo  P,  satisfies: 

0  <  o-2  =  f( y-  m(Fx))2  <  oo  (6a) 

Vi(PZi,Px)  <Mx||xt  -x||  (66) 
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then 

V^(m(Fs)  -  m(Fx))  N{0,  <r \ ).  (7) 

If  (6b)  only  holds  for  ||x,-  —  x||  <  D  <  oo  and  (5d)  is  strengthened  to 

y/njfoo  (Wx,  Sx)  -*  0  in  pr.  (8) 

then  (7)  holds. 

PROOF.  Write 

V^(rn(Fx)  -  m(P*))  =  y/n^  (j ^WjYj  -  m(Fx )) 

=  ^(E'w  - "><*’■)) + v^T.w^Y‘  - Y’)-  (») 

The  first  term  in  (9)  tends  in  distribution  to  N(0,  a*)  by  Lemma  4.2.2,  because  the  Y*  are 
independent  of  the  W<,  and  because  of  (5abc).  Under  (5de)  and  (6ab)  the  second  term  in 
(9)  converges  to  0  in  pr.  by  Lemma  4.3.2.  If  (6b)  only  holds  locally,  but  (8)  holds,  then 
the  second  term  in  (9)  converges  to  0  in  pr.  by  Lemma  4.3.3.  (Note  that  (5a)  implies 
P(n*>l)-1.)  I 

Schuster  (1972)  obtains  asymptotic  joint  normality  of  the  regression  function  at  a 
finite  number  of  points,  for  kernel  regressions.  The  regression  values  at  distinct  points 
are  asymptotically  independent.  Royall  (1966)  obtains  asymptotic  normality  for  near¬ 
est  neighbor  methods.  Stute  (1984)  obtains  asymptotic  normality  for  symmetric  nearest 
neighbor  methods  with  a  bounded  kernel.  Where  Schuster  assumes  a  finite  third  moment 
for  Y,  Stute  needs  only  a  finite  second  moment. 

4.4  Asymptotic  Distribution  of  v/n^(>,  -  Fx ) 

This  section  shows  that  the  conditional  empirical  process  Fx  —  Fx  has  a  Brownian  limit 
when  normalized  by  y/n^  under  very  general  conditions  on  the  weights. 

Start  by  making  the. split 


Fx-Fx  =  (Fx  -  FI)  +  (F;  -  Fx) 


4.4  Asymptotic  Distribution  of  y/n^(Fx  —  Fx)  83 


where  F*  is  obtained  by  replacing  each  Y,-  in  Fx  by  Yt*.  Recall  that  Y^.*  are  i.i.d.  from  Fx. 
The  first  term  above  is  the  variance  term  and  the  second  is  the  bias. 

The  goal  is  to  make  the  variance  term  normalized  by  y/n^  converge  to  a  Gaussian 
process  and  to  make  the  normalized  bias  term  converge  weakly  to  zero.  The  Gaussian 
process  is  supposed  to  be  the  Brownian  bridge  when  Fx  is  absolutely  continuous.  In  the 
absolutely  continuous  case  it  is  sufficient  to  consider  Fx  =  U[0, 1].  That  is  assume 

Fx(t)  =f  P(Y  <  t\X  =  i)  =  t. 

The  value  of  the  variance  process  at  t  is  then 

zn[t) 

Y.  ~  t) 

where  U{  are  i.i.d.  U  [0, 1].  To  accomodate  conflicting  conventions  the  sign  of  the  variance 
term  has  been  reversed. 

Let  0  <  h  <  ...  <  tk  <  1  for  some  finite  k.  The  vector  Vn  =  (Zn(ti), . . . ,  Zn(tk))' 
has  mean  zero  and  for  t  <  j  the  i,  j  element  of  its  variance  covariance  matrix  is  t,(l  -  ty). 
Thus  it  has  the  same  first  two  moments  as  the  Brownian  bridge  process. 

Lemma  4.3.1  If  Fx  is  uniform[0,l]  and  Fx  is  obtained  by  weights  satisfying  (4.2.8abc)  then 
the  finite  dimensional  distributions  of  y/n^[F*  —  Fx)  converge  to  those  of  the  Browniam 
bridge. 

PROOF.  Apply  Lemma  4.2.3. 

In  addition  to  the  convergence  of  the  finite  dimensional  distributions  to  those  of  the 
Brownian  bridge,  it  is  also  necessary  to  govern  the  behavior  of  the  process  over  small 
intervals.  This  is  usually  done  by  proving  uniform  tightness  of  the  sequence  of  processes. 
We  will  instead  use  a  similar  approach  from  Pollard  (1984,  Chapter  V).  Consider  Zn  as  a 
member  of  D[ 0, 1],  the  space  of  real  valued  functions  defined  on  [0, 1]  that  are  continuous 
from  the  right  and  have  limits  from  the  left.  Such  functions  are  sometimes  called  cadlag 
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functions,  from  the  French:  continue  a  droit,  limites  a  gauche.  Equip  the  space  D[0, 1] 
with  the  uniform  metric  d(Z,W)  =  sup^,^  | Z(x)  —  W (i)|  and  the  projection  (7-field. 
The  projection  cr-field  differs  from  the  usual  Borel  (7-field  in  that  empirical  distribution 
functions  are  measureable.  The  former  is  generated  by  all  closed  balls,  the  latter  by  all 
closed  subsets.  The  trace  of  the  projection  (7-field  on  C[0, 1],  the  space  of  continuous 
functions  on  [0, 1],  coincides  with  the  Borel  <7-field  of  C[ 0, 1].  For  a  detailed  discussion  of 
this  approach  see  Pollard  (1984).  His  Theorem  V.3  is  the  main  result.  It  is: 

Theorem  4.4.1  Let  Z ,  Z\,  Zi, ...  be  random  elements  of  D[0, 1]  under  the  uniform  metric 
and  the  projection  <r-field.  Suppose  P{Z  €  C}  =  1  for  some  separable  subset  C  of  Z)[0,  l]. 
The  necessary  and  sufficient  conditions  for  {Zn}  to  converge  in  distribution  to  Z  are: 

(i)  the  finite  dimensional  distributions  of  Z„  converge  to  those  of  Z 

(ii)  to  each  e  >  0  and  S  >  0  there  corresponds  a  grid  0  =  to  <  ti  <  . . .  <  £m  =  1  such 
that 

limsupP{max  sup  | Zn[t)  —  Z„(t,)|  >  6}  <  e.  (1) 

»'=° 

PROOF.  Pollard  (1984,  pp.  92-3). 

When  Z  is  the  Brownian  bridge  C  can  be  taken  to  be  C\ 0, 1]. 

Definition  A  sequence  Zn  of  random  elements  in  Z?[0, 1]  under  the  uniform  metric  and 
projection  <7-field  is  nearly  tight  if  condition  (ii)  of  Theorem  4.4.1  holds.  The  property  of 
being  nearly  tight  will  be  called  near  tightness. 

A  uniformly  tight  sequence  is  nearly  tight.  A  nearly  tight  sequence  need  not  be 
uniformly  tight.  For  example  Pollard  (1984)  shows  that  Fn  is  nearly  tight,  but  Fn  is  not 
uniformly  tight.  See  Fernholz  (1983,  p.  28)  for  a  characterization  of  the  bounded  sets  of 
D[0, 1]  that  have  compact  closure. 

To  establish  weak  convergence  of  the  variance  term  to  the  Brownian  bridge  it  only 
remains  to  prove  near  tightness  of  \/n^(F*  —  Fx). 
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In  certain  special  cases  the  convergence  to  the  Brownian  bridge  is  very  easy  to  show. 
We  can  borrow  from  the  i.i.d.  case  in  which  the  Brownian  limit  is  well  known,  in  the 
same  way  that  Lemma  4.2.1  borrows  from  the  i.i.d.  central  limit  theorem  via  Feller’s  the¬ 
orem.  Perhaps  the  simplest  is  the  following,  which  includes  k  nearest  neighbor  smoothers, 
symmetric  k  nearest  neighbor  smoothers  and  one  sided  nearest  neighbor  smoothers. 

Theorem  4.4.2  If  Fx  is  uniform[0,l]  and  Fx  is  obtained  by  weights  of  which 

k  =  k(n)  — >  oo  in  pr. 


are  1/k  and  the  rest  are  0,  then 

=  V^(FX  -FX)^B 

where  B  is  the  Brownian  bridge. 

PROOF.  We  have  =  1,  nx  =  k  and  y/n^m.s,x\Wi\  =  1  /y/k  so  by  Lemma  4.4.1 

the  finite  dimensional  distributions  of  Zn  converge  to  those  of  the  Brownian  bridge. 

Let  e  >  0  and  S  >  0.  It  is  well  known  that  the  desired  weak  convergence  holds  when 
k  =  n.  Let  Zn  be  the  sequence  of  processes  obtained  by  taking  W,-  =  1/n  in  the  expression 
for  Zn  and  by  taking  y/n  for  y/n^.  By  the  necessity  part  of  Theorem  4.4.1  there  is  a  grid 
0  =  to  <  ti  <  . . .  <  tm  =  1  such  that  (1)  holds  for  Zn. 

Now 

limsupP{max  sup  \Zn(t)  -  Zn(t,) j  >5} 

n-oo  »'=1  te[*,-,«i+l) 

=  lim  sup  P{max  sup  |  Z^  (t)  -  Z^  (*g)  |  >  5} 

.=i  te[t<it,.+l) 

<  lim  sup  P{max  sup  | Zn(t)  -  Zn{t ,)|  >  6} 

n-oo  *=1 

<C. 

The  equality  above  is  due  to  Zn  having  the  same  distribution  as  Zkn,  the  first  inequality 
follows  because  the  limit  supremum  of  a  sequence  is  no  less  them  that  of  any  subsequence, 
and  the  last  inequality  is  by  construction.  ■ 
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The  approach  above  generalizes  to  some  schemes  in  which  a  finite  number  of  weight 
levels  are  used.  For  example,  suppose  weight  2/(3 k)  is  put  on  each  of  the  k  nearest 
neighbors  and  weight  l/(3fc)  is  put  on  each  of  the  next  k  closest  neighbors.  The  process 
Zn  is  then  a  sum  of  two  processes,  one  for  the  nearest  neighbors  and  another  for  the  second 
group.  Each  term  in  the  sum,  converges  to  a  constant  times  a  Brownian  bridge.  For  each 
n  the  terms  are  independent.  It  follows  that  their  sum  converges  to  a  constant  times  a 
Brownian  bridge  and  the  normalization  yjn^  is  such  that  the  standard  Brownian  bridge 
would  be  the  result.  Preceding  from  2  to  L  levels  is  straightforward  and  many  interesting 
weight  schemes  can  be  approximated  this  way  for  large  L. 

A  large  class  of  weighting  schemes  might  be  shown  to  have  variance  terms  which  con¬ 
verge  to  the  Brownian  bridge  by  arguments  based  on  approximating  this  way  and  showing 
that  the  differences  between  the  approximate  and  actual  processes  are  asymptotically  neg¬ 
ligible.  Instead,  an  argument  that  parallels  the  development  of  the  functional  central  limit 
theorem  in  Pollard  (1984,  Chapter  V)  is  given  below. 

The  key  step  in  the  derivation  is  to  bound  the  probability  that  the  supremum  of  \Zn(t)\ 
over  a  short  interval  exceeds  a  constant  S  by  a  probability  based  only  on  the  difference 
between  the  endpoints  of  the  interval.  This  is  accomplished  by  the  following  lemma,  which 
Pollard  gives  as  Lemma  V.7: 

Lemma  4.4.2  Let  {Z[t)  :  0  <  t  <  1}  be  a  process  with  cadlag  sample  paths  taking  the 
value  zero  at  t  =  0.  Suppose  Z(t)  is  £t-measureable,  for  some  increasing  family  of  a-fields 
{St  '•  0  <  t  <  b}.  If  at  each  point  of  {|Z(t)|  >  5}, 

P  (  I Z(b)  -  Z( 01  <  ||Z(f)|  |  *,)  >  (9, 

where  /3  is  a  positive  number  depending  only  on  6,  then 


sup  |2(*)|>*  <p-xP{\Z{b)\>f}/2). 

Q<t<b  I 


PROOF.  Pollard  (1984,  pp.  94-5). 
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Theorem  4.4.3  If  Fx  is  uniform[0,l]  and  Fx  is  obtained  from  Wi  satisfying  (4.2.8abc) 
then 

Zn  =  Bo 

where  Bo  is  the  Brownian  bridge  process. 

PROOF.  By  Lemma  4.4.1  the  finite  dimensional  distributions  of  Zn(t)  converge  to 
those  of  Bo,  and  so  by  Theorem  4.4.1  it  only  remains  to  establish  (1). 

It  suffices  to  show  near  tightness  of  Zn  for  fixed  W,  satisfying 

nx  —*  co,  Wj  =  1  and  •y/n^maxlWil  — *•  0 

because  then  (1)  follows  for  random  W{  satisfying  (4.2.8abc)  by  the  technique  of  Lemma 
4.2.2. 

Let  e  >  0  and  S  >  0.  With  the  W,-  fixed, 

^(0d=  Y'WduiOt 

is  a  nondecreasing  process  with  cadlag  sample  paths.  W(0)  =  0  and  W(l)  =  1  and  W (•) 
jumps  by  the  fixed  amount  Wi  at  the  random  place  17, -.  Since  the  Ui  are  independent 
uniform[0,l]  the  process  W(t  +  s)  —  W(s)  on  0  <  t  <  b  <  1  —  a  has  the  same  distribution 
as  W(t)  on  0  <  t  <  b.  Note  that  Zn(t)  =  ^/n^(W (t)  —  t)  so  it  also  has  this  stationarity 
property. 

When  ti  =  i/m,  (1)  reduces  to 

lim sup\P{max  sup  | Zn(t)  —  Zn[U)\  >  5} 

*=0  te[t'/m,(«+l)/m) 
m—  1 

<  limsup  V'  P{  sup  I zn(t)  -  Zn(ti) I  >  5} 

i=0  *6[»/m,(*+l)/m] 

=  limsupmP{  sup  | Zn(t)  —  Zn{ti) \  >  5} 
te[o,i/m) 

using  the  stationarity  in  the  last  step.  The  last  probability  will  be  replaced  by  one  involving 
only  Zn(l/m).  For  notational  convenience  put  b  =  1/m. 
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Let  £t  be  the  <r-field  generated  by  Zn(s)  on  0  <  a  <  t.  It  is  determined  by  those  U, 
which  fall  in  the  same  interval.  Conditionally  on  St, 

D  =  Wn(b)  -  Wn[t) 

is  a  sum  of  a  random  number  of  Wi  which  are  themselves  randomly  selected  without 
replacement  from  the  W ,•  corresponding  to  Ui  >  t.  Given  St,  the  number  of  such  V7,-  has  a 
binomial  distribution  with  parameters  n*  and  pt  =  (b  —  t)/(l  —  t),  where  nt  is  the  number 
of  Ui  >  t. 

Lemma  4.4.3  below  establishes  the  bound 


1))!  |  f, )  <  Pin;1  +  p?  ( 1  -  W{t)f 

under  the  assumption  that  n*  >  2.  To  assume  that  nt  >  2  is  no  loss  of  generality  since 
nt  >  nj  and  nj/n  — »  6  a.s.  On  the  set  where  |Zn(t)|  >  S, 

P>(|Z„(6)-Z„(l)|>l/2|2„(i)||£,) 

=  P(\D  -  (»- 1)1  >  1/2|IV»(‘)  -  <1  I  ft) 
<4£((D-(i-i))’Kt)/GM‘)-‘)! 

<  4  (p,n;‘  +  p]  (tV„(i)  -  I)1)  /  -  tf 

=  4(p,n;1)/(H'„(t)-i)1  +  4P; 

<  4  (pm;1)  /  (s1*;1)  +  4pJ 

<4pt/S2  +  4pl 

<1/2 


for  small  enough  6,  that  is  for  large  enough  m.  (It  is  easy  to  see  that  pt  <  6(1  —  b )  l.) 
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Using  Lemma  4.4.2  with  ft  =  1/2,  and  the  convergence  of  Zn{b)  to  N(0, 6(1  —  6)) 

limsupmP{  sup  |.Zn(t)|  >  <  limsup2mP{|Zn(6)|  >  6/2} 

te[o,b\ 

=2mP{|iV(0,  b  —  62)j  >  5/2} 

<2mP{|iV(0, 6)|  >  6/2} 

=2mP{|iV(0,  l)j  >  S/2Vb} 

=2mP{N(0,  l)4  >  5* /16b2} 

<32 mb2£  (  N(0,  l)4  )  $-4 
<32m-1£  (  N(0,1)4  )  S~* 

<c 

for  large  enough  m.  | 

Lemma  4.4.3  Let  D  =  2Jt=i  where  R  has  a  binomial  distribution  with  parameters 
nt  >  2  and  pt  =  (b  —  t)/(l  —  t)  =f  1  —  qt  and  given  R  —  r,  the  W{  are  sampled  without 
replacement  from  nt  recil  numbers  that  sum  to  1  -  W  (t)  and  whose  squares  sum  to  less 
than  n~x.  Then 

e  (  (D  -  (6  -  ())’  )  <  p,nl'  +  rf(X  -  H'(O)’. 

PROOF.  Suppose  Wi  and  Wj  are  selected  by  sampling  without  replacement  from 
the  nt  numbers.  Then  £  (  W\  )  =  (1  —  W[t))/nt  and  £  (  W2  )  <  n~x /nt  and 

£  (  WXW2  )  <  (1  -  W{i))2/{nt  -  l)nt. 

Using  the  above  and  £  (  Q  )  =  £  (  £  (  Q|i?  )  )  for  various  Q, 

£  ((D-(S-())2) 

=f  (D!)-2(i-!)f(D)  +  (»-!)! 

=f  (  R  )  £  (  W?  )  +  £  (  fi!  -  R  )  £  (  HWi  )  -  2(}  -  t)£  (  R  )  £  (  W,  )  +  (6  -  l)2 

,  +  " <PiH  +  fegO!  -  «tPt  (1  _  w{f)f  _  2(6  _  t)p,  (1  -  W (1))  +  (i  - 1)1 

nt  -  nt 

= W  +  p2  (1  -  ^(t))2  -  2 (b  -  t)pt  (1  -  W(t))  +  (b  -  t)2 
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=Pt»z 1  +  (Pt  (1  “  w (0)  -  (&  -  0)2 

=Pt»;1  +  (6-i)2((i-^(0)/(i-0-i)2 

=W  +  (6-i)2((t-W(i))/(l-t))J 

=pmx1  +  pi  (t  -  w (t))2  a 

The  bias  term  is 

y/n%  -  ly*<t)- 

To  make  it  converge  to  zero,  it  is  necessary  to  have  each  Y,  close  to  Y* ,  or  to  have  the 
corresponding  close  to  zero.  The  worst  case  arises  when  Fx  is  a  stochastic  relative 
extremum  of  F,.  Then  all  of  the  ly,<t  —  ly*<<  are  of  the  same  sign.  Picture  a  sum  of 
boxcar  functions  with  height  \JnxWi  and  endpoints  Yi  and  Y*.  The  wide  ones  tend  to  be 
short  and  the  tall  ones  thin.  This  will  allow  pointwise  convergence  of  the  sum  to  zero.  For 
uniform  convergence  there  is  a  further  subtlety.  The  Y?  endpoints  are  i.i.d.  uniform[0,l], 
and  so  they  are  spread  out  over  the  interval.  But  the  Yi  endpoints  sure  not  uniform  and 
they  can  pile  up  in  arbitrarily  small  intervals.  Since  the  boxcar  functions  with  the  largest 
weights  have  z,-  close  to  z  they  also  have  Yi  close  to  Y*  and  so  their  Yi  are  well  spread 
out.  The  ones  that  might  pile  up  have  smaller  weight. 

So  that  Xi  close  to  x  implies  Y*  close  to  Yi,  we  impose  a  condition  on  Voo  (FXi,Fx). 
Sufficient  conditions  for  that  condition  will  be  given  later  in  Lemmas  4.4.4  and  4.4.5. 

We  also  will  need  a  condition  to  cause  the  bias  term  to  win  the  race  to  zero.  The 
proof  of  the  next  theorem  employs  a  truncation  of  observations  for  which  ||z,-  —  z||  >  An. 
The  sequence  An  has  to  be  small  enough  to  impose  good  behavior  on  the  truncated  term. 
Then  Wx  has  to  approach  Sx  fast  enough  that  the  truncation  has  a  negligible  impact.  If 
one  takes  A„  =  1/ (-y/n^  log  n)  then  it  will  be  necessary  for  nx  log  nVi(Wx,  Sx)  — +  0  in  pr. 
For  &-NN  this  means  that  the  average  distance  from  z  must  be  somewhat  smaller  than 
1/Jfc. 
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Theorem  4.4.4  Suppose  that  Fx  is  uniform[0, 1],  that  for  some  D  >  0 

Voo (FXi,  Fx)  <  Mx  || Xi  -  *||  whenever  ||z,-  -  *||  <  D,  (2) 

that  the  probability  weights  W ,•  satisfy  (4.2.8ab)  and  that  there  exists  a  sequence 

A  =  An(X1,...,Xn) 


such  that 


Then 


■y/n^A  — *  0  in  pr.  and  y/n^A  *V\(WXl  8X)  — >  0  in  pr. 
V^(P*  -  Fx)  Z  Bo 


where  Bo  is  the  Brownian  bridge. 

PROOF.  Let 

Zn[t)  =  -  t) 

be  the  variance  process  and 


Bn(t)  =  -  ly*<t) 

be  the  bias  term.  Under  the  conditions  above  the  variance  term  converges  to  the  Brownian 
bridge.  It  remains  to  show  that  the  supremum  of  the  absolute  value  of  the  bias  process 
converges  in  probability  to  0.  This  is  done  by  constructing  a  bounding  process  that  has 
the  desired  convergence. 

To  construct  the  bounding  process,  recall  that 


Voo (FXi,Fx)=  sup  |P-1(u)-p-1(u)|>|yiI-y<| 

0<u<l 


from  which 


|ly,<t  -  ly?<t|  <  U-Vco {F,.,Fx)<Y?<t+voa(FX{f,)- 
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Then 

|-Bn(i)|  <  \/”*  X) wi1t-v„{Fa.ja)<Yf<t+vaa(FM{jra) 

|z,-— *|<A 

+  V^  £  W‘ 

|z,~z|>A 

<  v/nT  W'lt  -AA/,<K|.,<t+AM, 

I*.— *|<A 

+  v^V’x(W.>5,)/A 

so  long  as  A  <  D.  Since  P( A  >  D)  — >  0  and  v/n^A-1Vi (WX,SX)  — ♦  0  in  pr.,  it  suffices  to 
show  that 

Gn(t)  =  v/n^5ZWil*  -AA/,<y;*<t+AAf, 
converges  uniformly  to  0  in  probability.  At  a  fixed  value  of  t 

P(\Gn{t)\>e)<e  +  P(£  (|Gn(t)|  |  X)  >  e2) 

<e  +  P  (yf^J2Wi2AMx  >  c2) 

<e  +  P  (v^AM,  >  e2) 

— »  e. 

Therefore  the  finite  dimensional  limiting  distributions  of  the  bounding  process  Gn  are  all 
degenerate  at  0.  It  remains  to  show  that  Gn  is  nearly  tight,  and  that  is  accomplished  by 
using  the  near  tightness  of  the  variance  process. 

Notice  that  the  range  of  YJ  might  be  larger  than  that  of  Y*.  However,  in  this  case 
the  bias  process  outside  the  range  assumes  its  maximum  at  0  or  1.  It  follows  that  near 
tightness  need  only  be  shown  in  the  interval  [0, 1]. 

Pick  e  >  0  and  6  >  0.  Pick  m  so  that 

limsupP{  max  sup  |Zn(£)  -  Zn(t,) |  >  5}  <  e. 

o<«-<m— l  te[,ymi(,-+i)/m) 

Such  an  m  was  constructed  in  proof  of  Theorem  4.4.3.  By  symmetry  it  follows  that 
limsupP{  max  sup  | Zn{t)  —  Zn(ti)  |  >  6}  <  e. 

!<*<m 
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Now  for  U  =  i/m, 

G»»(0  =  +  AM,)  —  Zn{t  —  AMX)  +  2y/n^AMx 


so  that 

Gn(t)  -  Gn(ti)  =  Zn{t  +  AM,)  -  Z„(i  -  AM,) 

-  Zn(t,-  +  AM,)  +  2B(ti  -  AM,) 

=  Zn(t  +  AM,)  —  Zn{ti+ 1) 

+  ^n(t»+l)  -  +  AM*) 

+  Zn(t.)  -  Zn[t  -  AM,) 

+  Zn{U  -  AM,)  -  Z„(t<) 

Since  P(AMX  >  1/m)  — >  0,  it  may  be  assumed  that  AMX  <  1/m  so  that  t  € 
implies  that  either  t  +  AM,  €  [t,-,  t,+i)  or  t  +  AMX  €  [*»'+ 1  ,  t,+2).  Similarly  there  are  two 
intervals  that  might  possibly  contain  t  —  AM,.  Using  elementary  bounds 

limsupP{  max  sup  |Gn(t)  -  Gn(t,)|  >  65}  <  6e. 

l<i<mte(,/mi(i+i)/m] 

This  completes  the  proof.  ■ 

Theorem  4.4.4  shows  that  very  general  weighting  schemes  are  capable  of  providing 
asymptotically  Brownian  estimates  of  the  uniform  distribution.  The  conclusion  is  appli¬ 
cable  so  long  its  the  distributions  of  the  random  variables  Fx(Yi)  satisfy  the  condition 
above.  This  does  not  follow  from  a  similar  condition  applied  to  Ft.  Sufficient  condi¬ 
tions  are  provided  by  the  next  lemma: 

Lemma  4.4.4  Suppose  that  Fx  admits  a  density  that  is  bounded  above  by  B  <  oo,  and 
that  Fm  satisfies  V,*, (FXi,  Fx)  <  Az||x,-  -  x||.  Then  for  some  Mx, 


Vco  (C(FX(Y)  |  X  =  xi),  JL[FX{Y)  \  X  =  *))  <  Mx\\x{  -  x||. 
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PROOF. 

Voo  U(FX(Y)  |  X  =  Xi),  C(FX(Y)  |  X  =  x))  =  sup  \Fa  (*■-*(«))  -  «| 

0<u<l 

<  B  sup  |f-*(u)  -  f,(u)| 

0<u<l 

<  BV^F^FJ 
<=  BAx\\xi  -  x|| 

so  we  may  take  Mx  =  BAX.  ■ 

A  restriction  to  distributions  with  bounded  density  is  unpalatable,  since  it  rules  out 
such  distributions  as  the  exponential.  Large  densities  allow  Fx(y)  to  be  very  different  from 
FX;{yi)  even  when  y  is  close  to  y,.  It  is  often  reasonable  to  suppose  that  when  Fx  has  a 
large  density  that  FXi  does  too,  when  xt-  is  close  to  x.  This  motivates  the  next  lemma. 

Lemma  4.4.5  Suppose  that  F,  satisfies 

KS(FZi,Fx)<Mx  ||x,-x||. 


Then 


Voo  (£{Fx(Y)  I  X  =  Xi),  C{FX(Y)  I  X  =  x))  <  Mx\\Xi  -  x||. 


PROOF.  Let  U  be  a  uniform  [0,1]  random  variable. 


Voo  (£{Fx{Y)  |  X  =  Xi),  C(FX(Y)  \  X  =  x))  =  sup  \Fx(F^(u))  -  u| 

0<u<l 

=  sup  |^,(^-1(«))--P,x.(i?',-1(«))l 

0<u<l 

<KS(FXiFXi ) 

<  M,||x  -  Xi 1 1  I 


Theorem  4.4.5  Suppose  that  Fx  is  absolutely  continuous,  and  that  W,  are  probability 
weights  satisfying  (4.2.8ab)  and  that  there  exists  a  sequence 

A  =  An(Xi,  —  ,Xn) 


such  that 


>/n^A  — ►  0  in  pr.  and  \/n^A  1Vx  (W*,  Sx)  — *•  0  in  pr. 
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and  that  for  ||x<  —  x||  <  D  €  (0,  oo),  F,  satisfies  either 

KS{FXi,Fx)<Mx  ||x,--x|| 
or 

Voo (Fx.,  Fx)  <  Mx  ||xt-  -  x||  and  sup  ^F*(y)  <  K  <  oo. 

Then 

y/^(Fx  -Fx)^B 

where  B  is  a  continuous  Gaussian  process  with  mean  0  and  for  set 

Cov(B(s),S(t))  =  i?’I(s)(l-i?’I(t)). 

PROOF.  Apply  Lemmas  Lemmas  4.4.4  and  4.4.5  and  Theorem  4.4.4.  I 

Stute  (1986)  obtains  a  Brownian  limit  for  symmetric  nearest  neighbor  estimates  with 
a  bounded  kernel  function.  (See  Sec.  2.2  for  a  kernel  based  definition  of  symmetric  nearest 
neighbor  methods.)  His  results  are  obtained  for  multivariate  Y  and  univariate  X.  For  the 
variance  term,  Stute  assumes  that 

sup  |Px»(t)  -  Fx,(a)\  =  o  ((logS-1)-1) 

|t— .|<« 

as  5  — ►  0  unformly  for  x1  in  a  neighborhood  of  x.  Stute  remarks  that  this  implies  equicon- 
tinuity  of  Fxi  (y) ,  which  is  refered  to  as  KS  continuity  here. 

4.5  Asymptotic  Normality  of  Running  Functionals 

In  this  section  we  apply  the  results  of  the  earlier  sections  and  the  theory  of  compact 
differentiability  to  consider  asymptotic  normality  for  a  class  of  statistical  functionals.  For 
a  brief  summary  of  compact  differentiability  and  von  Mises*  method  see  Sec.  2.5.  For  a 
complete  exposition  see  Fernholtz  (1983)  or  Reeds  (1976). 

Suppose  that  the  statistical  functional  T  has  a  compact  derivative  T'F  at  Fx.  Then 
V^(T{FX)  -  T(FX))  =  V^T'Fi(Fx  -  Fx)  +  y/K-xRem{Fx  -  Fx) 

=  Fx, T)  +  y/n^Rem(Fx  -  Fx) 
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so  if  the  random  variables  Vi  =  IC{Yi\  Fx,  T)  and  the  weights  Wi  satisfy  the  conditions  of 
Theorem  4.3.1  then  the  lead  term  is  asymptotically  normal.  If  also  the  remainder  term 
converges  to  0  in  probability,  then  y/n^(T(Fx)  —  T(FX))  is  asymptotically  normal.  For 
each  functional,  good  behavior  of  the  lead  term  implies  a  regularity  condition  on  F,.  We 
establish  asymptotic  negligibility  of  the  remainder  term. 

Following  Fernholz  (1983, Ch  4),  we  assume  that  Fx  is  U[0, 1]  and  that  the  statistical 
functional  is  defined  on  £>[0, 1].  This  is  only  a  slight  loss  of  generality.  A  statistical 
functional  T  induces  a  functional  r  on  £>[0, 1]  by  r(G)  =  T(G  o  Fx).  So  long  as  Fx 
is  increasing,  any  distribution  function  can  be  expressed  as  G  o  Fx  for  some  G.  The 
asymptotic  negligibility  of  the  remainder  term  will  be  established  by  an  argument  that 
parallels  Fernholz ’s  (1983,  Secs.  4. 1-4. 3)  which  is  in  turn  based  on  a  method  of  Reeds’ 
(1976,  Sec.6.5).  There  are  two  important  differences.  Since  Fx  is  measureable  in  this 
treatment,  there  is  no  need  to  appeal  to  inner  or  outer  measures.  More  seriously,  the 
unequal  weighting  of  observations  in  Fx  adds  complication.  It  will  be  necessary  to  assume 
that  the  weighting  is  not  too  unequal. 

We  use  the  following  lemma  from  Fernholz  (1983).  The  distance  between  H  €  £>[0, 1] 
and  K  C  I?[0, 1]  will  be  taken  to  be  diat(H,  K)  =  intaeK  H-S"  —  G||. 

Lemma  4.5.1  Let  Q  :  D[0, 1]  x  IR  — *•  JR  and  suppose  for  any  compact  set  K  C  JD[0, 1] 

limQ(ff,t)=0 

uniformly  in  H  €  K.  Let  e  >  0  and  let  6„  I  0  be  a  sequence  of  numbers.  Then  for  any 
compact  K  C  Z?[0, 1],  there  exists  no  such  that  for  n  >  no,  and  dist{H ,  K )  <  Sn, 

\Q(H,Sn) |  <e. 

PROOF.  See  Fernholz  (1983,  Lemma  4.3.1) 

To  apply  Lemma  4.5.1  to  a  sequence  with  5n  — ►  0  in  pr.,  the  following  version  is  of 


more  direct  use. 
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Corollary  Let  Q  be  as  in  Lemma  4.5.1  and  let  e  >  0.  Then  for  any  compact  K  C  D[0, 1] 
there  exists  rj  >  0  such  that  6  <  r\  and  dist(H,  K)  <  8  implies 

\Q{H,S)\<c. 

PROOF.  Suppose  not.  Then  there  is  a  compact  set  K  and  infinite  sequences  i jf-  j  0 
and  Hi  such  that  dist(Hi,  K)  <  iji  and  Q(Hi,  rj,)  >  e.  But  this  contradicts  Lemma  4.5.2.  ■ 

The  next  lemma  is  used  in  the  proof  of  the  convergence  of  the  remainder  term.  It  is  of 
some  interest  in  its  own  right  since  it  has  weaker  conditions  on  the  weights  than  Theorem 

/v/ 

4.5.1.  Introduce  the  process  Fx  for  which 

Z(Yi)  =  Fx(Yi),  K(  0)  =  0,  Z(l)  =  l 

and  Fx  is  piecewise  linear  over  the  n  +  1  intervals  between  those  points.  Assume  that  the 
FXi  are  continuous  distributions  so  that  there  are  no  ties.  Then  by  construction 

lK(v)  -  >.(y)l  <  max  |W,| 

l<t<n 

for  all  y  E  [0, 1].  For  the  rest  of  this  section 

W  =  max  \W'\. 

l<*<n 

Lemma  4.5.2  Suppose  that  T  has  compact  derivative  n  2it  XJ  —  j  all  t»lic  arc 
continuous  and  that 

V^(FX  -Fx)?*Bo  (1) 

where  Bq  is  the  Brownian  bridge.  Then 

y/n^Rem(Fx  -  Fx )  —*■  0  in  pr.  (2) 

Remarks  Sufficient  conditions  for  (1)  are  given  in  Theorem  4.4.4.  See  also  Theorem  4.4.5. 
Note  that  (1)  implies  nx  — >  oo  in  pr.  and  y/n^W  — >  0  in  pr. 
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PROOF.  Let  e  >  0.  The  process  y/n^{Fx-Fx)  is  within  y/n^W  of  y/n^(Fx—Fx)  Bo. 
Since  , /n^W  — ♦  0  in  pr.  it  follows  that  y/n^(Fx-Fx)  Bo-  Moreover  since  y/n^(Fx-Fx)  G 
C[0, 1],  a  separable  metric  space,  there  is  by  Prohorov’s  theorem  a  compact  set  K  €  C[0, 1] 
such  that 

P  (n/57(K  -  F.)  €  K)  >  1  -  <r. 

K  is  also  compact  in  D  [0, 1]. 

Because  T  is  compactly  differentiable,  for  H  E  K  and  nx  sufficiently  large  (greater 
than  n*  say) 


Therefore 

P  -  F*)|  >  e)  <  e  +  P{nx  <  n*)  -♦  e, 

and  so  y/n^Rem(Fx  —  Fx)  — »  0  in  pr.  | 

We  see  also  that  if  T  has  a  Frechet  derivative  then  y/n^Rem(Fx  —  Fx)  — ►  0  in  pr.  under 
the  conditions  of  Lemma  4.5.2.  This  is  because  the  set  {H  :  dist(H,  K )  <  e}  is  bounded 
for  compact  K  and  the  remainder  term  converges  to  zero  uniformly  over  bounded  sets 
under  Frechet  differentiability. 

Theorem  4.5.1  Suppose  that  T  has  compact  derivative  T{j  at  U  =  Fx,  all  the  FX{  are 
continuous,  that  (2)  holds  and  that  nxW  =  Op(l).  Then  y/n^Rem(Fx  —  Fx)  — *  0  in  pr.. 

Remark  The  condition  that  nxW  =  Op(  1)  is  not  too  restrictive.  A  “fair  share”  for  a  point 
would  be  1/n*  and  the  condition  bounds  the  multiple  of  that  amount  that  any  point  can 
receive.  Also  note  that  by  consideration  of  the  finite  dimensional  distribution  functions 
that  (2)  implies  (4.2.8abc),  and  in  particular  that  nx  — *  oo  in  pr. 

PROOF.  Let  e  >  0.  Choose  B  <  oo  so  that  lim sup P{nxW  >  B)  <  e.  By  the 
argument  of  Lemma  4.5.2  there  is  a  compact  set  K  C  D[0, 1]  such  that 


P  (V^(FX  -Fx)ek)>  1  -  e 
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and  since 

\\Z-K\\  <w 


by  construction, 

P  (di8t(y/n^(Fs  -  Fx),  K)  >  y/n^w'j  <  e. 


By  the  compact  differentiability  of  T  at  U,  the  function 


Q(H,t) 


Rem(B~1Ht) 

B-H 


satisfies  the  conditions  of  Lemma  4.5.1.  By  the  Corollary  to  Lemma  4.5.1  there  exists 
t)  >  0  such  that  S  <  rj  and  dist(H,  K)  <  8  imply  \Q(H,t)\  <  e. 

Therefore 

P  (\y/^Rem(Ft  -  Fx)\  >  e) 

=P  [\Q{yfc{Fx  -  Fx),B/y/fQ\  >  e) 

<P  (B/V^  >V)  +  P  (dist^{Fx  -  FX),K)  >  B/y/^) 

-P  (dist(,/^(Fx  -  Fx),  K)  >  B/yfc) 

<e  +  P  {y/n^W  >  B/y/n^) 

=e  +  P{nxW  >  B) 

so  that 

limsupP(j>/n^Pem(Px  -  Px)|  >  e)  <  2 e.  ■ 


Which  statistical  functionals  induce  functionals  on  £>[0, 1]  that  are  compactly  differ¬ 
entiable  at  17?  Fernholtz  (1983)  establishes  such  compact  differentiability  for  M  estimates 
with  continuous  piecewise  differentiable  rp,  such  that  t p'  is  bounded  and  vanishes  off  a 
bounded  interval,  when  F  has  a  piecewise  continuous  positive  density.  She  also  estab¬ 
lishes  compact  differentiability  for  some  L  estimates: 

/h(Ps_1(u))M(u)du 

o 
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provided  h  is  continuous  and  piecewise  differentiable  with  bounded  derivative  (usually  one 
takes  h(y)  =  y)  and  M  €  LJ[a,  1  —  a]  for  some  a  €  (0, 1/2],  and  F  has  a  positive  density. 
Similar  regularity  conditions  on  R  estimators  make  them  compactly  differentiable.  Quan¬ 
tiles  get  special  treatment.  She  shows  that  they  induce  compactly  differentiable  functionals 
on  C[ 0, 1]  when  F  is  well  behaved  near  the  quantile  in  question.  The  asymptotic  negligi¬ 
bility  of  the  remainder  term  for  quantiles  then  follows  by  considering  continuous  versions 
of  the  empirical  distribution  function  that  are  constructed  to  agree  with  the  empirical  at 
the  quantile. 
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